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eralizes several methods, including Efron's bootstrap penalization and the 
leave-one-out penalization recently proposed by Arlot (2008), to any ex- 
changeable weighted bootstrap resampling scheme. In the heteroscedastic 
regression framework, assuming the models to have a particular structure, 
these resampling penalties are proved to satisfy a non-asymptotic oracle 
inequality with leading constant close to 1. In particular, they are asym- 
potically optimal. Resampling penalties are used for defining an estimator 
adapting simultaneously to the smoothness of the regression function and to 
the hetcroscodasticity of the noise. This is remarkable because resampling 
penalties are general-purpose devices, which have not been built specifically 
to handle heteroscedastic data. Hence, resampling penalties naturally adapt 
to hetcroscodasticity. A simulation study shows that resampling penalties 
improve on V-fold cross-validation in terms of final prediction error, in 
particular when the signal-to-noisc ratio is not large. 
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1. Introduction 

In the last decades, model selection has received much interest. When the final 
goal is prediction, model selection can be seen more generally as the question 
of choosing between the outcomes of several prediction algorithms. With such 
a general formulation, a natural and classical answer is the following. First, 
estimate the prediction error for each model or algorithm; second, select the 
model minimizing this criterion. Model selection procedures mainly differ on 
the way of estimating the prediction error. 

The empirical risk, also known as the apparent error or the resubstitution 
error, is a natural estimator of the prediction error. Nevertheless, minimizing 
the empirical risk can fail dramatically: the empirical risk is strongly biased for 
models involving a number of parameters growing with the sample size because 
the same data are used for building predictors and for comparing them. 

In order to correct this drawback, cross-validation methods have been in- 
troduced [4, G5], relying on a data-splitting idea for estimating the prediction 
error with much less bias. In particular, T^-fold cross-validation (VFCV, [36]) is 
a popular procedure in practice because it is both general and computationally 
tractable. A large number of papers exist about the properties of cross-validation 
methods, showing that they are efficient for a suitable choice of the way data are 
split (or V for VFCV). Asymptotic optimality results for leave-one-out cross- 
validation (that is the V = n case) in regression have been proved for instance 
by Li [49] and by Shao [GO]. However, when V is fixed, VFCV can be asymptot- 
ically suboptimal, as showed by Arlot [9] . We refer to the latter paper for more 
references on cross-validation methods, including the small amount of available 
non-asymptotic results. 

Another way to correct the empirical risk for its bias is penalization. In short, 
penalization selects the model minimizing the sum of the empirical risk and of 
some measure of complexity^ of the model (called penalty); see FPE [2], AIC 
[3], Mallows' Cp or Cl [51]. Model selection can target two different goals. 
On the one hand, a procedure is efficient (or asymptotically optimal) when its 
quadratic risk is asymptotically equivalent to the risk of the oracle. On the other 
hand, a procedure is model consistent when it selects the smallest true model 
asymptotically with probability one. This paper deals with efficient procedures, 
without assuming the existence of a true model. Therefore, the ideal penalty for 
prediction is the difference between the prediction error (the "true risk" ) and the 
empirical risk; penalties should be data-dependent estimates of the ideal penalty. 

Many penalties or complexity measures have been proposed. Consider for 
instance regression and least-squares estimators on finite-dimensional vector 
spaces (the models). When the design is fixed and the noise- level constant equal 
to cr, Mallows' Cp penalty [51] is equal to 2n~^a^D for a model of dimension 
D and it can be modified according to the number of models [20, 58]. Mallows' 

^Note that "complexity" here and in the following refers to the implicit modelization of 
a model or an algorithm, such as the number of estimated parameters. "Complexity" does 
not refer at all to the computational complexity of algorithms, which will always be called 
"computational complexity" in the following. 
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Cp-like penalties satisfy some optimality properties [61, 49, 14, 21] but they 
can fail when the data are heteroscedastic [7] because these penalties are linear 
functions of the dimension of the models. 

In the binary supervised classification framework, several penalties have been 
proposed. First, VC-dimension-based penalties have the drawback of being inde- 
pendent of the underlying measure, so that they are adapted to the worst case. 
Second, global Radcmachcr complexities [4-5, 17] (generalized by Fromont with 
resampling ideas [33] ) take into account the distribution of the data, but they 
are still too large to achieve fast rates of estimation when the margin condition 
[■53] holds. Third, local Radcmachcr complexities [18, 46] are tighter estimates 
of the ideal penalty, but their computational cost is heavy and they involve huge 
(and sometimes unknown) constants. Therefore, easy-to-compute penalties that 
can achieve fast rates are still needed. 

All the above penalties have serious drawbacks making them less often used in 
practice than cross-validation methods: AIC and Mallows' Cp rely on strong as- 
sumptions (such as homosccdasticity of the data and linearity of the models) and 
some mainly asymptotic arguments; VC-dimension-based penalties and global 
Rademacher complexities are far too pessimistic; local Rademacher complexities 
are computationally intractable, and their calibration is a serious issue. Another 
approach for designing penalties in the general framework may not suffer from 
these drawbacks: the resampling idea. 

Efron's resampling heuristics [2!)] was first stated for the bootstrap, then gen- 
eralized to the exchangeable weighted bootstrap by Mason and Newton [-54] and 
by Praestgaard and Wellner [57]. In short, according to the resampling heuristics, 
the distribution of any function of the (unknown) distribution of the data and 
the sample can be estimated by drawing "resamples" from the initial sample. In 
particular, the resampling heuristics can be used to estimate the variance of an 
estimator [29], a prediction error [67, 32] or the ideal penalty (using the boot- 
strap [30, 31, 43], the M out of n bootstrap^ [59] or a y-fold subsamphng scheme 
[9]). The asymptotic optimality of Efron's bootstrap penalty for selecting among 
maximum likelihood estimators has been proved by Shibata [62]. Note also that 
global and local Rademacher complexities are using an i.i.d. Rademacher re- 
sampling scheme for estimating different upper bounds on the ideal penalty and 
Fromont 's penalties [34] generalize the global Rademacher complexities to the 
exchangeable weighted bootstrap. 

The first goal of this paper is to define and study general-purpose penalties, 
that is penalties well-defined in almost every framework and performing rea- 
sonably well in most of them, including regression and classification. The main 
interest of such penalties would be the ability to solve difficult problems (for 
instance heteroscedastic data, a non-smooth regression function or the fact that 
the oracle model achieves fast rates of estimation) without knowing them in 
advance. From the practical point of view, such a property is crucial. 

To this aim, the resampling heuristics with the general exchangeable weighted 
bootstrap is used for estimating the ideal penalty (Section 2). This defines a 



^Shao's goal in [5!)] was not efficiency but model consistency. 
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wide family of model selection procedures, called "Resampling Penalization" 
(RP), which includes Efron's and Shao's penaHzation methods [30, 59] as well 
as the leave-one-out penalization defined in [!)]. To our knowledge, it has never 
been proposed with such general resampling schemes, so that the RP family 
contains a wide range of new procedures. Note that RP is well-defined in a 
general framework, including regression and classification, but also many other 
application fields (Section 7.2). Even if the main results are proved in the least- 
squares regression framework only, we obviously do not mean that RP should 
be restricted to this framework. 

In this paper, the model selection efficiency of RP is studied with a unified 
approach for all the exchangeable resampling schemes. Therefore, comparing 
bootstrap with subsampling is quite straightforward (Section 5) which is not 
common in the resampling literature (except a few asymptotic results, see Barbe 
and Bertail [15]). 

The point of view used in the paper is non- asymptotic^ which has two major 
implications. First, non-asymptotic results allow to consider collections of mod- 
els depending on the sample size n: in practice, it is usual to increase the number 
of explanatory variables with the number of observations. Considering models 
with a large number of parameters (for instance of order for some a > 0) 
is also particularly useful for designing adaptive estimators of a function which 
is only assumed to belong to some Holderian ball (see Section 3.2). Thus, the 
non-asymptotic point of view allows not to assume that the regression function 
is described with a small number of parameters. 

Second, several practical problems are "non-asymptotic" in the sense that 
the signal-to-noise ratio is low. As noticed in [9], with such data, VFCV can 
have serious drawbacks which can be naturally fixed by using the flexibility of 
penalization procedures. It is worth noting that such a non-asymptotic approach 
is not common in the model selection literature and few non-asymptotic results 
exist on general resampling methods. 

Another important point is that the framework of the paper includes several 
kinds of heteroscedastic data. The observations (Xi,li)i<i<n are only assumed 
to be i.i.d. with 

Y, = s{X,) + a{X,)e,, 

where s : A" i-^ R is the (unknown) regression function, a : X t-^ M is the 
(unknown) noise-level and has zero mean and unit variance conditionally 
on Xi. In particular, the noise- level a{X) can strongly depend on X and the 
distribution of ct can depend on Xi. Such data are generally considered as 
difficult to handle because no information on a is known, making irregularities of 
the signal difficult to distinguish from noise. As already mentioned, simple model 
selection procedures such as Mallows' Cp can fail in this framework [7] whereas it 
is natural to expect that resampling methods are robust to heteroscedasticity. In 
this article, both theoretical and simulation results confirm this fact (Sections 3 
and 5). 

The two main results of the paper are stated in Section 3. First, making 
mild assumptions on the distribution of the data, a non-asymptotic oracle in- 
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equality for RP is proved with leading constant close to 1 (Theorem 1). It 
holds for several kinds of resampling schemes (including bootstrap, leave-one- 
out, half-subsampling and i.i.d. Rademacher weighted bootstrap) and implies 
the asymptotic optimality of RP, even when the data are highly heteroscedas- 
tic. For proving such a result, each model is assumed to be the vector space 
of piecewise constant functions (histograms) on some partition of the feature 
space. This is indeed a restriction, but we conjecture that it is mainly tech- 
nical and that RP remains efficient in a much more general framework (see 
Section 7.2). Moreover, studying extensively the toy model of histograms allows 
to derive precise heuristics for the general framework. A major goal of the pa- 
per is to help practicioners who would like to know how to use resampling for 
performing model selection (see in particular Sections 6 and 7.3). 

Second, RP is used to build an estimator simultaneously adaptive to the 
smoothness of the regression function (assuming that s is a-H61derian for some 
unknown a € (0,1]) and to the unknown noise-level cr(-) (Theorem 2). This 
result may seem surprising since RP has never been designed specifically for such 
a purpose. We interpretate Theorem 2 as a confirmation that RP is naturally 
adaptive and should work well in several other difficult frameworks. 

Several results similar to Theorem 1 exist in the literature for other proce- 
dures such as Mallows' Cp (with homoscedastic data only), VFCV and leave-one- 
out cross-validation. Moreover, there exist several minimax adaptive estimators 
for heteroscedastic data with a smooth noise-level, for instance [28, 3-5], and the 
regression function and the noise level can be estimated simultaneously [37]. In 
comparison, the interest of RP is both its generality (contrary to Mallows' Cp 
and specific adaptive estimators) and its flexibility (contrary to VFCV, see [9]), 
as detailed in Section 7.1. 

A simulation study is conducted in Section 5 with small sample sizes. RP 
is showed to be competitive with Mallows' Cp for "easy" problems, and much 
better for some harder ones (for instance with a variable noise- level). Moreover, 
a well-calibrated RP yields almost always better model selection performance 
than VFCV. Therefore, RP can be of great interest in situations where no a 
priori information is known about the data. RP can deal with difficult problems, 
and compete with procedures that are fitted for easier problems. In short, RP 
is an efficient alternative to VFCV. 

This article is organized as follows. The framework and the Resampling Pe- 
nalization (RP) family of procedures are defined in Section 2. The main results 
are stated in Section 3. The differences between the resampling weights are 
investigated in Section 4. Then, a simulation study is presented in Section 5. 
Practical issues concerning the implementation of RP are considered in Sec- 
tion 6. RP is compared to other penalization methods in Section 7.1 and the 
extension of RP to the general framework is discussed in Section 7.2. Finally, 
Section 8 is devoted to the proofs. Some additional material (other simulation 
experiments and proofs) is available in a technical Appendix [8]. 
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2. The Resampling Penalization procedure 

In order to simplify the presentation, we choose to focus on the particular frame- 
work of least-squares regression on models of piecewise constant functions (his- 
tograms), which is the framework of the main results of Section 3 and the 
simulation study of Section 5. 

Nevertheless, the RP family is a general-purpose method which can easily be 
defined in the general prediction framework. The main interest of the histogram 
framework is to provide general heuristics about RP, so that the practicioner 
can make the best possible use of RP in the general framework. A discussion on 
RP in the general prediction framework is provided in Section 7.2, including a 
general definition of RP. 

2. 1 . Framework 

Suppose we observe some data (Xi, Yi), . . . , (X„, y„) e A" x R, independent 
with common distribution P, where the feature space X is typically a compact 
set of M.^ . Let s denote the regression function, that is s{x) = E [y | X = x]. 
Then, 

Y, = s{Xi) + a{X^)e^ (1) 

where cr : A" i-^ R is the hctcrosccdastic noise-level and are i.i.d. centered 
noise terms; the possibly depend on Xi, but they are have zero mean and 
unit variance conditionally on Xi. 

The goal is to predict Y given X where {X, Y) ^ P \s independent of the data. 
The quality of a predictor < : A" i— > R is measured by the quadratic prediction loss 
P7(t) := ^(^x,Y) [lit, (X, F))], where (X, F) - P and -fit, [x, y)) := (tix) - yf 
is the least-squares contrast. Since P'y{t) is minimal when t — s, the excess loss 
is defined as 

e{s,t):^P^{t)-P^{s)^ E(x,y) (t{X) - s{X)f . 

Given a particular set of predictors Sm (called a model) , the best predictor over 
Sm is defined as 

Sm ■■= arg min { P^{t) } , 
with its empirical counterpart 

Sm ■= arg mm {P„7(t)} 

(when it exists and is unique) where P„ = '^"^i S(^Xi,Yi) is the empirical 
distribution. The estimator 'sm is the well-known empirical risk minimizer, also 
called least-squares estimator since 7 is the least-squares contrast. 

In this article, we mainly consider histogram models S',„, that is of the fol- 
lowing form. Let (/a)aga be some fixed partition of X. Then, Sm denotes the 
set of functions A" i-^ R which are constant over I\ for every A S A,„; Sm is a 
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vector space of dimension Dm — Card(Am), spanned by the family (l/;^)AeAm- 
The empirical risk minimizer over an histogram model Sm is often called a 
regressogram. 

Explicit computations are easier with regressograms because (l/^)AeA„, is an 
orthogonal basis of L^{n) for any probability measure ^ on X. In particular, 

Sm = ^ f3\li^ and = ^ Pxli^, 

AeA,„ AGAm 



where f3\ 



Lp[Y \ Xeh], A-^^ y and px P„(X e h 

77 ,7) ^ ^ ^ 



Note that is uniquely defined if and only if each Ix contains at least one of 
the Xi, that is min;^gA„ P\ > 0. 

Let us assume that a collection of models iSni)nieM„ given. Model selection 
consists in selecting some data-dependent rh £ Mn such that £(s,s^) is as 
small as possible. General penalization procedures can be described as follows. 
Let pen : i-^ be some penalty function, possibly data-dependent, and 
define 

TO g arg min { P„7 ( Sm ) + pen(TO) } . (2) 
Since the goal is to minimize the loss Pj (s"„i), the ideal penalty is 

penM ■■= (P - Pnh {s„^) , (3) 

and we would like pen(m) to be as close to pen;(j(TO) as possible for every 
TO £ Mn- In the histogram framework, note that Sm is not uniquely defined 
when min^gAm Px = 0; then, we consider that the model Sm cannot be chosen, 
which is formally equivalent to add -|~cx)l^.^^ ^ px-o penalty pen(TO). 

When Sm is the histogram model associated with some partition {Ix)x£A 
of X, the ideal penalty (3) can be computed explicitly: 

peUid (to) = (P - PnhiSm) + P (Sm) - 1 (Sm)) + Pn il { Sm ) " 7 ( ) ) 

^ {P - Pnhism) + [pa(/3a-/3a)'+Pa(^a-/3a)'] (4) 
AeA,„ 

where px := V {X G Ix). The ideal penalty pen;j(TO) is unknown because it 
depends on the true distribution P; therefore, resampling is a natural method 
for estimating pcnj^(m). 



2.2. The resampling heuristics 

Let us recall briefly the resampling heuristics, which has been introduced by 
Efron [29] in the context of variance estimation. Basically, it says that one can 
mimic the relationship between P and P„ by drawing a n-samplc with common 
distribution P„ , called the "resample" ; let denote the empirical distribution 
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of the resamplc. Then, the conditional distribution of the pair (P„, P^) given P„ 
should be close to the distribution of the pair {P,Pn)- Hence, the expectation 
of any quantity of the form F{P,Pn) can be estimated by Ew [F{Pn, P^)]- 
The expectation Kyy [ ■ ] means that we integrate with respect to the resampling 
randomness only. Let us emphasize that penijj(m) has the form F{P, Pn). 

Later on, this heuristics has been generalized to other resampling schemes, 
with the exchangeable weighted bootstrap [54, 57]. The empirical distribution 
of the resamplc then has the general form 

n 

Pn' -E^Ax^y,) 

where W € M" is an exchangeable^ weight vector independent of the data and 
such that Vz, Ew [Wi] = 1. In this article, W is also assumed to satisfy Vi, 
Wi>0 a.s. and EH^[Wi2] ^ 

We mainly consider the following weights, which include the more classical 
resampling schemes: 

1. Efron (M), M e N\{0} (Efr): ((M/n)VF,)i<,<„ is a multinomial vector 
with parameters (M; n~^, . . . , n^^). A classical choice is M = n. 

2. Rademacher (p), p G (0; 1) (Rad): (pWi) are independent, with a Bernoulli 
(p) distribution. A classical choice is p ~ 1/2. 

3. Poisson (/i), € (0,oo) (Poi): {fJ.Wi) arc independent, with a Poisson (/i) 
distribution. A classical choice is ^ = 1. 

4. Random hold-out (g), g G {1, . . . , n} (Rho): Wi ~ {n/q)li^i where / is a 
uniform random subset of cardinality g of {1, . . . , ti}. A classical choice is 
q = n/2. 

5. Leave-one-out (Loo) = Rho (n — 1). 

In the following, Efr, Rad, Poi, Rho and Loo respectively denote the above re- 
sampling weight vector distributions with the "classical" value of the parameter. 

Remark 1. The above terminology explicitly links the weight vector distribu- 
tions with some classical resampling schemes. See [54, 40, 66] for more details 
about classical resampling weight names, as well as other classical examples. 

• The name "Efron" comes from the classical choice M = n for which Efron 
weights actually are the bootstrap weights. When M < n, Efron(M) is 
the M out of n bootstrap, used for instance by Shao [59]. 

• The name "Rademacher" for the i.i.d. Bernoulli weights comes from the 
classical choice p = 1/2 for which {Wi — l)i arc i.i.d. Rademacher random 
variables. For instance, global and local Rademacher complexities use this 
resampling scheme to estimate different upper bounds on pen^^^m) (see 
Section 7.2.4). 

• Poisson weights are often used as approximations to Efron weights, via the 
so-called "Poissonization" technique (see [66, Chapter 3.5] and [33]). They 



is said to be exchangeable when its distribution is invariant by any permutation of its 
coordinates. 
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are known to be efficient for estimating several non-sniootli functionals (see 
[15, Chapter 3] and [52, Section 1.4]). 
• The Random hold-out (g) weights can also be called "delete-(n — q) jack- 
knife", as well as the Leave-one-out weights also refer to the jackknife 
(sometimes called cross-validation). They are both resampling schemes 
without replacement [(id. Example 3.6.14], more often called subsampling 
weights (see for instance the book by Politis, Romano and Wolf ['■'{>] on sub- 
sampling) . They are close to the idea of splitting the data into a training 
set and a validation set (for instance, leave-one-out, hold-out and cross- 
validation). Indeed, if one defines the training set as 

{{X„Y,) s.t. W,^0} 

and the validation set as its complement, there is a one-to-one correspon- 
dence between subsampling weights and data splitting. 



2.3. Resampling Penalization 

Applying directly the resampling heuristics of Section 2.2 for estimating the 
ideal penalty (3), we would get the penalty 



[Pn7isZ)^P^lisZ)], (5) 



where := arg min P^^jit) = ^ ^ E 



P^iX eh)=pxWx and Wx ^ ^ 



Two problems have to be solved before defining properly the Resampling Pe- 
nalization procedure. Here, we focus on the histogram framework; the general 
framework will be considered in Section 7.2. 

First, (5) is not well-defined because s"^ is not unique if minA£Am pY ~ ^■ 
Hence, even when minA6A„, Px > 0, the problem occurs as soon as Wx = for 
some A G Am, which has a positive probability (except when = 1) for most 
of the resampling schemes since Pw (Vi > 2, Wi = ) > 0. In order to make (5) 
well-defined, let us rewrite the resampling penalty as the resampling estimate 
of (4), that is 

[Pnl {sZ) - P^l {^)] = Mm) + Pi (m) + p^ (m) 

where 

1 " 

Po{m) -.^Ew [{Pn-Pn)l{sm)] = -J2i^w[l-W,]j{sm;{X,,Y,)))^0 
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Table 1 

Cw for several resampling schemes (see Section 3.4.1) 



V{W) 


Efr(M) 


Rad(p) 




Rho{g) 


Loo 


Cw 


M/n 


p/{l~p) 




q/{n - q) 


n - 1 



because Ew-[Wi] = 1 for every i, 

xeAm ^ ^ 
and P2(m):= ^ [ewIpT^PY-Px)' 



AeA„ 



With the convention {(3j^ — (3\)'^ = when pY = 0, P2{iti) is well-defined 
since is well-defined when p^ > 0. It remains to define properly pi{m). 
We suggest to replace the expectation over all the resampling weights by an 
expectation conditional on Wx > 0, separately for each m G Mn and X € A^, 
which ensures that we only remove a small proportion of the possible resampling 
weights. To summarize, (5) is replaced by 



E 



Px 



3w 



/3a 



M^A > 



+ Ew 



Px 



W ( nW r. 

Px - Px 



(6) 



Second, (6) is strongly biased as an estimate of peuj^ when var(W"i) is small, 
because is then much closer to P„ than P„ is close to P. Assuming the Sm 
to be histogram models, we will prove in Section 3.4.1 (see Propositions 1 and 2) 
that the bias can be corrected by multiplying (6) by a constant Cw which only 
depends on the distribution of W. The values of Cw for the classical weights 
are reported in Table 1. Remark that Cw = 1 in the bootstrap case (Efr), as 
well as for Rad, Poi and Rho. 

We are now in position to define properly the Resampling Penalization (RP) 
procedure for selecting among histogram models. See Section 7.2 for the defini- 
tion of RP in the general framework (Procedure 3). 

Procedure 1 (Resampling Penalization for histograms). 

1. Replace A^„ by 



Mn ^ Im e Mn s.t. 



min {npx} > 3 >. 
AeA,„ J 



2. Choose a resampling scheme 'D{W). 

3. Choose a constant C > Cw where Cw is defined in Table 1. 

4. Define, for each m € Mn, the resampling penalty pen(m) as 



C 



A6A„ 



E 



w 



PX { pY 



■Px 



W^A > 



E 



w 



pfif^ 



(7) 
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5. Select TO e argmin^^g^^ {Pnl{sm ) + pen(TO) }. 

Remark 2. 1. At step 1, we remove more models than those for which 

is not uniquely defined. When np\ = 1 for some A G A„j, estimating the 
quality of estimation of /3a with only one data-point is hopeless with no 
assumption on the noise- level a. The reason why we remove also models 
for which min^eA,,, { np\ } = 2 is that the oracle inequalities of Section 3 
require it for some of the weights; nevertheless, such models generally have 
a poor prediction performance, so that step 1 is reasonable. 

2. At step 3, C can be larger than Cw because overpcnalizing can be fruitful 
from the non-asymptotic point of view, in particular when the sample size 
n is small or the noise level a is large. The simulation study of Section 5 
provides experimental evidence for this fact (see also Section 6.3.2). 

3. RP (Procedure 1) generalizes several model selection procedures. With a 
bootstrap resampling scheme (Efr) and C = 1, RP is Efron's bootstrap pe- 
nalization [30] , which has also been called EIC in the log- likelihood frame- 
work [43]. With an M out of n bootstrap resampling scheme (Efr(M)) 
and (7 = 1, RP has been proposed and studied by Shao [59] in the con- 
text of model identification. Note that Cy/ ^ 1 for Efr(7\/) weights if 
M ^ n\ this crucial point will be discussed in Section 3.4.1. RP with a 
(non-exchangeable) V-fold subsampling scheme has also been proposed 
recently in [9]. 

4. When W are the "leave-onc-out" weights, RP is not the classical leave- 
one-out model selection procedure. Nevertheless, according to [9], when 
C = n — 1, it is identical to Burman's n-fold corrected cross-validation 
[22], hence close to the uncorrected one. 

3. Main results 

In this section, we state some non-asymptotic properties of Resampling Penaliza- 
tion (Procedure 1) for model selection. First, Theorem 1 is an oracle inequality 
with leading constant close to 1. In particular, Theorem 1 implies the asymptotic 
optimality of RP. Second, Theorem 2 is an adaptivity result for an estimator 
built upon RP, when the regression function belongs to some Holderian ball. 
A remarkable point is that both results remain valid under mild assumptions 
on the distribution of the noise, which can be non-Gaussian and highly het- 
eroscedastic. 

Throughout this section, we assume the existence of non-negative constants 
cn-Mi Crich such that: 

(PI) Polynomial size of A^„: Card(A^„) < CA^n"-^. 
(P2) Richness of Mn- 3too G M.n s.t. G [-/ri; Cndiv^]- 
(P3) The weight vector W is chosen among Efr, Rad, Poi, Rho and Loo (defined 
in Section 2.2, with the classical value of their parameter). 

(PI) is a natural restriction since RP plugs an estimator of the ideal penalty 
into (2). When Card(A^„) is larger, say proportional to e°" for some a > 0, 
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Birge and Massart [2 J ] proved that penalties estimating the ideal penalty cannot 
be asymptotically optimal. (P2) is merely technical. (P3) can be relaxed, as 
explained in Section 4.2. 

3.1. Oracle inequality 

Theorem 1. Assume that the data (Xi,li)i<i<„ satisfy the following: 
(Ab) Bounded data: \\Yi\\^ < A < oo. 

(An) Noise-level bounded from below: (T{Xj) > ffaun > a.s. 
(Ap) Polynomially decreasing bias: there exist /3i > /32 > and C^,C^ > 
such that 

Vm eMn, C^Dj' < £ ( s, s,n ) < C+Dj- . 

(Arf-) Lower regularity of the partitions for 'D{X): there exists c^^ > such 
that 

Vm e A^„, min px > cf^. 

Let fh be defined by Procedure 1 [under restrictions (PI — 3), with C = Cw)- 
Then, there exist a constant Ki > and an absolute sequence e„ converging to 
zero at infinity such that, with probability at least 1 — A'itt,"^, 

i{s,s-)<{l+en) inf {£(.?,?,„)}. (8) 

Moreover, 

< (l+e„)E 

The constant Ki may depend on constants in (Ab), (An), (Ap), [Axf-) and 
(PI — 3) but not on n. The term En is smaller than (ln(?i) ) ; e„ can also be 
made smaller than n^^ for any < S < Sn{(3i, (52) at the price of enlarging Ki. 

Theorem 1 is proved in Section 8.3. The non-asymptotic oracle inequality 
(8) implies that Procedure 1 is a.s. asymptotically optimal in this framework if 
lim„^tx)(C/Cw) = 1. When W are Efr weights, the asymptotic optimality of RP 
was proved by Shibata [62] for selecting among maximum likelihood estimators, 
assuming that the distribution P belongs to some parametric family of densities 
(see also Remark 6 in Section 3.4.1). 

Resampling Penalization yields an estimator with an excess loss as small 
as the one of the oracle without requiring any knowledge about P such as 
the smoothness of s or the variations of the noise-level a. Therefore, RP is a 
naturally adaptive procedure. Note that (8) is even stronger than an adaptivity 
result because of the leading constant close to one, whereas adaptive estimators 
only achieve the correct estimation rate up to a possibly large absolute constant. 
Hence, one can expect that an estimator obtained with RP and a well chosen 
collection of models is almost optimal. 

We now comment on the assumptions of Theorem 1 : 



inf {£(s,s„ 

mGM„ 



)} 



(9) 
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1. The constant C can differ from Cw- For instance, when a constant rj > 1 
exists such that C € [CwwC^w]! the oracle inequahties (8) and (9) hold 
with leading constant 277 — 1 + £„ instead of 1 + £„. 

2. (Ab) and (An) are rather mild and neither A nor a^i^ need to be known 
by the statistician. In particular, quite general heteroscedastic noises are 
allowed; (Ab) and (An) can even be relaxed as explained in Section 3.3.2. 

3. When X has a lower bounded density with respect to Leb, (Ar^) is satis- 
fied for "almost piecewise regular" histograms, including all those consid- 
ered in the simulation study of Section 5. 

4. The upper bound in (Ap) holds with (3 = 2ak~^ when (/A)AeA„ is regular 
on A" C M'^ and s is a-H61derian with a > 0. The lower bound in (Ap) is 
discussed extensively in Section 3.3.1. 



3.2. An adaptive estimator 

A natural framework in which Theorem 1 can be applied is when A" is a compact 
subset of R*^, X has a lower bounded density with respect to the Lebesgue 
measure and s is a-H61derian with a £ (0, 1]. Indeed, the latter condition ensures 
that regular histograms can approximate s well. In this subsection, we show 
that Resampling Penalization can be used to build an estimator adaptive to the 
smoothness of s in this framework. 

We first define the estimator. For the sake of simplicity"', X is assumed to be 
a closed ball of (M'^', |H|^), say [0, 1]'=. 

Procedure 2 (Resampling Penalization with regular histograms). For every 
T e N\ {0}, let -S'„(T) be the model of regular^ histograms with T'"' bins, that 
is the histogram model associated with the partition 



k 



T' T 



0<Ji,...,ifc<T-l 



Then, define (5™)„eA<„ := ('5'm(T) ) i<3.<„i/fc ■ 

0. Replace A^„ by 

Mn = |m e A4n s.t. min {npx} > 3>. 
L AeA„ J 

1. Choose a resampling scheme 'D{W) among Efr, Rad, Poi, Rho and Loo. 

2. Take the constant C = Cw as defined in Table 1. 

3. For each m £ Mn-, compute the resampling penalty pcn(m) defined by (7). 



"^11 X has a smooth boundary, Procedure 2 can be modified so that the proof of Theorem 2 
remains valid. 

^When X has a general shape, assume that both Leb{A:') and diam{A:') for \\-\\ arc finite. 
Then, a partition {Ix )agA,„ regular with bins when Card{Am) = T and there 

exist positive constants ci, C2, C3, C4 such that for every A £ Am, c\T~^ < hch(Ix) < C2T~'' 
and C3T-1 < diam(/;^) < C4T-1. 
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4. Select ffi e argmin ^^^^^ {Pnl{sm ) + pen(m) }. 

5. Define := . 

m 

Theorem 2. Let X = [0,1]'^. Assume that the data {Xi,Yi)i<:i<n satisfy the 
following: 

(Ab) Bounded data: \\Yi\\^ < A < oo. 

(An) Noise-level bounded from below: <y{Xi) > (T,nin > a.s. 
(Adf) Density bounded from below: 

3c^> > 0, V/ C X, P{X el)> " Leb(/). 

(Ah) Holderian regression function: there exist a G (0; 1] and R > such that 

s£T-L(a,R) that is yxi,X2 & X ^ \s{xi) — s{x2)\ < R\\xi — X2\\'^^ ■ 

Let s be the estimator defined by Procedure 2 and (Tmax sup;^, \a\ < 2A. 
Then, there exist positive constants A'2 and such that, 

¥.[i{s,s)]<K2R^n^cj^+K:,A^n-^. (10) 

// moreover the noise-level is smooth, that is 
(Act) a is piecewise K„-Lipschitz with at most J„ jumps, 

then, assumption (An) can be removed and (10) holds with (imax replaced by 
IkllL^(Leb) [{\^eh{X))-^ !^a\t)dtY/\ 

For both results, K2 may only depend on a and k. The constant may only 
depend on k. A, c™™, R, a {and Omin for (10); and for the latter result). 

Theorem 2 is proved in Section 8.5. The upper bounds given by Theorem 2 
coincide with several classical minimax lower bounds on the estimation of func- 
tions in 7i(a, R) with a G (0, 1], up to an absolute constant. In the homoscedastic 
case, lower bounds have been proved by Stone [03] and generalized by several 
authors among which Korostelev and Tsybakov [47] and Yang and Barron [69] . 
Up to a multiplicative factor independent of n, R and a the best achievable rate 
is 

2k -2a 4a 

Hence, (10) shows that Procedure 2 achieves the right estimation rate in terms 
of n, R and cr, without using the knowledge of a, R or a. 

Moreover, (10) still holds in a wide heteroscedastic framework, without using 
any information on the noise-level cr(-). Then, up to a multiplicative constant 
independent of n and R (but possibly of the order of some power of crmax/cmin), 
the upper bound (10) is the best possible estimation rate. 

Minimax lower bounds proved in the heteroscedastic case (see for instance 
[28, 35] and references therein) show that when k = a = 1 and the noise-level is 
smooth enough, the best achievable estimation rate depends on a through the 

multiplicative factor llcll^afLeb)- Therefore, the upper bound given by Theorem 2 
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under assumption (Act) is tight, even through its dependence on the noise- level. 
Up to our best knowledge, such an upper bound had never been obtained when 
a G (0, 1) and fc > 1, even with estimators using the knowledge of a, ct and R. 

Theorem 2 shows that Procedure 2 defines an adaptive estimator^ uniformly 
over distributions such that s belongs to some Holderian ball Ti{a,R) with 
a G (0, 1] and the noise-level ct is not too pathological. This result is quite strong. 
Although similar properties have already been proved for "ad /loc" estimators 
(see [28, 35] and Section 7.1.3), Resampling Penalization has not been designed 
specifically to have such a property. Therefore, exchangeable resampling penal- 
ties are naturally adaptive to the smoothness of s and to the heteroscedasticity 
of the data. 

Remark 3. 

1. The proof of Theorem 2 shows that achieves the minimax rate of es- 
timation on an event of probability larger than 1 — K'^n . In particular, 
with probability one. 



2. If s is piecewise a-Holderian with at most Jg jumps (each jump of height 
bounded by 2A), then (10) holds with A'3 depending also on Jg. 

3. As for Theorem 1, the boundedness of the data and the lower bound on 
the noise level can be replaced by other assumptions (see Section 3.3.2). 

3.3. Discussion on some assumptions 

The aim of this subsection is to discuss some of the main assumptions made in 
Theorems 1 and 2. We first tackle the lower bound in (Ap) which is required 
in Theorem 1. Then, two alternative assumption sets to Theorems 1 and 2 are 
provided, allowing the noise level to vanish or the data to be unbounded. 

3.3.1. Lower hound in (Ap) 

The lower bound £(s,s™) > D-^^ in (Ap) may seem unintuitive because 
it means that s is not too well approximated by the models Sm- Assuming 
that vcdrn^Mn ^ ( s, Sm ) > is classical for proving the asymptotic optimality of 
Mallows' Cp]61, 49, 21]. 

Let us explain why (Ap) is used for proving Theorem 1. According to Re- 
mark 8 in Section 8.2, when the lower bound in (Ap) is no longer assumed, (8) 
holds with two modifications on its right-hand side: the infimum is restricted to 
models of dimension larger than (ln(n) and a remainder term (ln(n) )'^^ 
is added (where 71 and 72 are absolute constants). This is essentially the same 
as (8) unless there exists a model of small dimension with a small bias; the lower 
bound in (Ap) is sufficient to ensure this does not happen. Note that assump- 
tion (Ap) was made in the density estimation framework [64, 23] for the same 
technical reasons. 
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As showed in [n], (Ap) is at least satisfied with 

Pi = k-^ + a-^ - (fc - l)k-^a~^ and P2 = 'iak''^ 

in the following case: (/A)AeA„ is "regular" (as defined in Procedure 2 below), 
X has a lower-bounded density with respect to the Lebesgue measure Leb on 
X cM!" and s is non-constant and a-H61derian (with respect to 

The general formulation of (Ap) is crucial to make Theorem 1 valid whatever 
the distribution of X which can be useful in some practical problems. Indeed, 
when X has a general distribution, a collection [Sm)m£Mn satisfying (PI), 
(P2), (Arf ) and (Ap) can always be chosen either thanks to prior knowledge 
on ^{X) or to unlabeled data. In the latter case, classical density estimation 
procedures can be applied for estimating 'D^X) from unlabeled data (see for 
instance [2(i] on density estimation). Assumption (Ap) then means that the 
collection of models has good approximation properties, uniformly over some 
appropriate function space (depending on P(X)) to which s belongs. 

3.3.2. Two alternative assumption sets 

Theorems 1 and 2 are corollaries of a more general result, called Lemma 7 in 
Section 8.2. The assumptions of Theorems 1 and 2, in particular (Ab) and 
(An) on the distribution of the noise a{X)e, are only sufficient conditions for 
the assumptions of Lemma 7 to hold. The following two alternative sufficient 
conditions are proved to be valid in Section 8.4. 

First, one can have Umin = in (An) if moreover E [(t(X)^] > 0, A" C M'"' is 
bounded and 

(ArjJ) Upper regularity of the partitions for \\-\\^'. BcJ?^,,^^; > such that 
Vm G Mn, niax { diam(/A) } < c'^ ^D~°''' . 

(Aru) Upper regularity of the partitions for Leb: u > such that 
Vm e M„, max {Leb(/A) } < Cr,uD:^^- 

(Act) (T is piecewise iiTcr-Lipschitz with at most Ja jumps. 

Second, the Yj can be unbounded (assuming now that ffmhi > in (An)) if 
moreover A" C M is bounded measurable and 

The noise is sub-Gaussian: 3cgauss > such that 

Vq>2, Va-eA-, E[|en X ^ xf" < c^^uss^Q- 

Noise- level bounded from above: cr^(X) < cr^a^ < -l-oo a.s. 
Bound on the regression function: \\s\\^ < A. 

s is B-Lipschitz, piecewise and non-constant: ±s' > -Bq > on 
some interval J C X with Leb(J) > cj > 0. 
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(Ar^^u) Regularity of the partitions for Leb: Elcr^£,Cr.u > such that 

Vm e Mn, VA e A„, Cr,iD-^^ < Lchih) < c,^^,D;^. 

(Adf) Density bounded from below: 3c™" > 0, V/ C X, V{X G I) > 
c-g'" Lcb(/). 

Third, it is possible to have simultaneously CTmin = in (An) and unbounded 
data, see [i<] for details. 

The above results mean that Theorem 1 holds for most "reasonably" diffi- 
cult problems. Actually, Proposition 3 and Remark 7 show that the resampling 
penalties are much closer to E[pen;j(m)] than pen;(j(m) itself, provided that 
the concentration inequalities for peu;^ are tight (Proposition fO). Therefore, 
up to differences within e„, RP with C = C'w and the "ideal" deterministic pe- 
nalization procedure E [pen;j(m)] perform equally well on a set of probability 
1 — Kin~^. For every assumption set such that the proof of Theorem 1 gives 
an oracle inequality for the penalty E [peni(j(7Ti)], the same proof gives a similar 
oracle inequality for RP. 

3.4- Probabilistic tools 

Theorems 1 and 2 rely on several probabilistic tools of independent interest: 
precise computation of the expectations of resampling penalties (Propositions 1 
and 2), concentration inequalities for resampling penalties (Proposition 3) and 
bounds on expectations of the inverses of several classical random variables 
(Lemma 4-6). Their originality comes from their non-asymptotic nature: explicit 
bounds on the deviations or the remainder terms are provided for finite sample 
sizes. 

3.4- 1- Expectations of resampling penalties 

Using only the exchangeability of the weights, the resampling penalty can be 
computed explicitly (Lemma 16 in Section 8.8). This can be used to compare 
the expectations of the resampling penalties and the ideal penalty. First, Propo- 
sition 1 is valid for general exchangeable weights. 

Proposition 1. Let Sm be the model of histograms associated with some parti- 
tion of X and W £ [0, oo)" be an exchangeable random vector indepen- 
dent of the data. Define 'pen-^^{m) by (3) and pen(m) by (7). Lef E'^'" [•] denote 
expectations conditionally on {lxi<£i!,)i<i<n. \eA„^ ■ Then, if imii\^\^^px > 0, 




E^'" [pen(m)] =- V {Ri,w{n,Px) + R2,win,Px)) al (12) 
n ^ — ' 
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with 
Ri,w{n,p\) 



wl 



Xieh,Wx>Q 



=E 



In particular, 



E[penid(m)] 



(13) 
(14) 

(15) 



AeA„ 



where Sn.p only depends on {n,p) and satisfies \Sn,p\ < Liinp) for some 
absolute constant Li . 

Proposition 1 is proved in Section 8.8. 

Remark 4. • In order to make the expectation in (15) well-defined, a conven- 
tion for penj(j(TO) has to be chosen when va\nxeA„^Px = 0. See Section 8.1 
for details. 

• Combining Proposition 1 with [(>, Lemma 8.4], a similar result holds for 
non-exchangeable weights (with only a modification of the definitions of 
Ri,w and i?2,w)- 

In the general heteroscedastic framework (1), Proposition 1 shows that resam- 
pling penalties take into account the fact that a\ actually depends on A G A^. 
This is a major difference with the classical Mallows' Cp penalty 

PfillMallowsV"^) ■— ~ 

which does not take into account the variability of the noise level over X. A 
more detailed comparison with Mallows' Cp is made in Section 7.1.1. 

If Ri,win-,Px) + R2.win,p\) does not depend too much on px (at least when 
npx is large), Proposition 1 shows that pen(TO) estimates unbiasedly pen^^(m) 
as soon as^ 

9 1 

C = Cw 



Ri,w{n,l) + R2,w{n,l) E[(VKi-1) ]' 

In particular, all the examples of resampling weights given in Section 2.2 satisfy 
that Ri,w{n-^px) ~ R2,wi'n',px) docs not depend onpx when npx is large, which 
leads to Proposition 2 below (see Table 2 for exact expressions of i?2,w and Cw)- 

Proposition 2. Let W be an exchangeable resampling weight vector among 
Efr{Mn), Rad(p), Poi{p), Rho{\n/2\) and Loo, and define Cw o,s in Table 2. 



®The definition of Cw actually used in this paper is slightly different for Efron(M) and 
Poisson(/i) weights (see Table 2). We arbitrarily chooscd the simplest possible expression 
making Cw asymptotically equivalent to 1/E[( Wi — 1)^]. The results of the paper also hold 
when Cw = l/E[(VKi - 1)^]. 
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Table 2 

_R2,vk('^iPa) o.'f^d C'w foi~ several resampling schemes. The formulas for R2 w come from 

Lemma 17 



V{W) 


Efr(M) 


Rad(p) 


Poi(M) 


Rho(q) 


Loo 


R2,w{n,Px) 




p 


A' V "PA / 


£ _ 1 


1 

n-l 


Cw 


M/n 


p/{l-p) 


A* 


9/ - <?) 


n - 1 



Let Sm be the model of histograms associated with some partition (/a )x^/^ of X 
and pen(m) be defined by (7). Then, there exist real numbers j'p™^-* depending 
only on n, p\ and the resampling scheme ^{W) such that 

E-|pc.(™)l = ^E + (16) 

If Mnu-^ >B>Q (Efr), p £ (0; 1) (Rad) or fi > (Poi), then, 
V7ieN\{0},VpAe (0,1], ^J^P^""^^ <L^{npxr"\ 

n,PA 

where L2 > is an absolute constant (for Rho{ln/2\) and Loo) or depends 
respectively on B (Efr), p (Rad) or n (Poi). More precise bounds for each 
weight distribution are given by (62) -(66) in Section 8.9. 

Proposition 2 is proved in Section 8.9. 
Remark 5. Proposition 2 can also be generalized to Rho((7„) weights with < 
B- < QnTT."^ < i?+ < 1, but the bound on j(p™^) only holds for np\ > 

n,px 

L{B_,B+) and L2 depends on B_,B^ (see Section 8.9). 

Remark 6. Combined with the explicit expressions of Cw for several resampling 
weights (Table 2), Proposition 2 helps to understand several known results. 

• In the maximum likelihood framework, Shibata [(j2] showed the asymptot- 
ical equivalence of two bootstrap penalization methods. The first penalty, 
denoted by Bi, is Efron's bootstrap penalty [30], which is defined by (5) 
with Efr weights. The second penalty, denoted by B2, was proposed by 
Cavanaugh and Shumway [25]; it transposes 

2pi(m) - 2Ew [Pn (lisZ) - lism))] 

into the maximum likelihood framework. In the least-squares regression 
framework (with histogram models), the proofs of Propositions 1 and 2 
show that 

E^'" [2pi(m)] = ^J2 Ri,w{n,P>^)'^l ~ [pen(m)] 
aga,„ 

for several resampling schemes, including Efron's bootstrap (for which 
Cw = 1). The concentration results of Section 8.10 show that this remains 
true without expectations. 
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• With Efron (M„) weights (and a bootstrap selection procedure close to 
RP, but with C = 1), Shao [59] showed that Af„ = n leads to an incon- 
sistent model selection procedure for identification. On the contrary, when 
M„ oo and M„ ^ n, Shao's bootstrap selection procedure is model con- 
sistent. Proposition 2 shows that these assumptions on M„ can be rewritten 
C = 1 ^ Cw = Mn/n. Therefore, the rationale behind Shao's result may 
mostly be that identification needs overpenalization within a factor tending 
to infinity with n. 



3.4-2. Concentration inequalities for resampling penalties 
From (4), the ideal penalty can be written 

Hence, peni^(TO) is a U-statistics of order 2 conditionally on (lXiG/A)(i,AeAm)j 
which is sufficient to prove that resampling yields a consistent estimate of 
pen-^{m) (Arcones and Gine [5] considered the bootstrap case; Huskova and 
Janssen [42] extended it to the exchangeable weighted bootstrap). 

In the non-asymptotic framework, that is when the models Sm can depend 
on n, the following concentration inequality is needed. 

Proposition 3. Let 7 > 0, An > 2 and W be an exchangeable weight vector. 
Let Sm be the model of histograms associated with some partition {Ix)xeA„^ of 
X and pen(TO) be defined by (7). Assume that two positive constants ai and 
exist such that for every q > 2, 



4 



E 



< atq^' where m,,x := (E [\Y - | X G Ix])' 



AgA™ '"'2, a 



Let Qm{An) denote the event {min>,GA„j {npx} > An}. Then, there exist con- 
stants K^jK^ > and an event of probability at least 1 — K^n^'^ on which 

|pen(TO) -E^" [pcn(m)]| ln,„(A„) < CKr, 

(ln(n) 

X sup {Riw{n,p) + R2,win,p)} — E[p2{m)] 

where Ri^w and R2.W defined by (13) and (14). The constant K4 is absolute 
and may only depend on ai, and 7. 

// moreover W satisfies the assumptions of the second part of Proposition 2 
and Cw is defined as in Table 2, then a constant K\y > exists such that 

I f \ TOA™r t Mil CK5KwiHn))^'^\ , , f.rj-. 

|pen(m) -E "[pen(m)]| ln^(A„) < „ E[p2(m)] . (17) 

For the Rad{p) weights, Kw is smaller than (1 —p)^^ multiplied by an absolute 
constant. For the other weights, Kw is an absolute constant. 
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Proposition 3 is proved in Section 8.10.1. Note that the moment condition 
holds under the assumptions of Theorem 1 as well as the alternative assumptions 
of Section 3.3.2. It is here stated in its most general form. 

— 1/2 

Remark 7. Since the An factor should tend to infinity with n for most reason- 
able models, Proposition 3 gives better bounds for resampling penalties than 
what could be obtained for ideal penalties with Proposition 10 in the same 
framework. 

Although we do not know how tight arc the bounds of Proposition 3, such a 
phenomenon is classical with bootstrap and can be understood from the asymp- 
totic point of view through Edgeworth expansions [39]. In a non-asymptotic 
Gaussian framework, [10, Section 2.3] shows the same property for resampling 
estimators, which concentrate at the rate A^~^ instead of N~^^'^ {N being the 

— 1/2 

amount of data). Since An plays the role of N, the gain An can reasonably 
be conjectured to be unimprovable without some more assumptions. 

Let us emphasize that if resampling penalties estimate E [ peu;^ (m) ] instead 
of pen;j(m), RP with C = Cw cannot take into account the fact that pcnjj(m) 
may be far from its expectation. 

3.4-3. Expectations of inverses 

For any non-negative random variable Z, wc define 

4=e+(^) ■.= E[Z]E[Z-'\ Z>0]. 

This quantity appears in the explicit formulas for Ri^w when W is among the 
examples of resampling weights of Section 2.2 (see Lemma 17). Therefore, in 
order to prove Proposition 2, non-asymptotic bounds on are needed when Z 
has a binomial, hypergeometric or Poisson distribution. 

Former results concerning can be found in papers by Lew [4n] (for general 
Z), by Jones and Zhigljavsky [44] (for the Poisson case) and by Znidaric [70] (for 
the binomial and Poisson case), but they are either asymptotic or not precise 
enough. Lemmas 4-6 solve this issue. 

In the rest of the paper, for any a, 6 € M, a A denotes the minimum of a and 
b and a V 6 denotes the maximum of a and b. 

Binomial case 

Lemma 4. For any n € N\{0} and p € (0;1], B{n,p) denotes the binomial 
distribution with parameters {n,p), ki := 5.1 and K2 := 3.2. Then, if np > 1, 

A (1 + K,{np)-'/^ ) > e+ „^p) > 1 - e-"P (18) 
and 2 4-3 X 10"^ > e+^^^ 1 ^ > 1„>3. (19) 

The first bounds (18) were first stated in [9, Lemma 3] where they are proved. 
The second ones (19) are proved in Section 8.11.1. Lemma 4 implies in particular 
that Cg. p) ^ 1 when np oo, which can be derived from [70]. 
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Hypergeometric case Recall that an hypergeometric random variable X 
Ti.{n, r, q) is defined by 



Vfc e {0,...,gAr}, ¥{X = k) 



Lemma 5. Let n,r,q Cz N such that n > r > 1 and n > q > 1. 
1. General lower-bound: 



+ ( V 

^H(n,r,q) - ^ ^ lr<ri-g CXp — 



n 

2 



2. General upper-bound: let e G (0; 1) and usie) := 0.9 + 1.4 x e 

n , ^ 2r 
If r>2 and - < (1 - e)- 



q 2+ y/3{r-{-l)\n{r) 

Then, 1 + n,{e)^^l^ > e^^^^^^^^y (20) 



3. "Rho" case: tfn>2, 



14.3>sup{e+( and 3 > sup { e+(„^ j J . (21) 



4- "Loo" case: 

1 + -ZTZ TT ^ = 1 + - -7- Tt1''>2 - 1 > 1 - —■ (22) 



n(r-l)- nVw(?'-l) 
5. "Lpo " case: ifn>r>n — q+1 > 2, 

r-n + q ^ n{n - 1) ■ ■ ■ {q + 1) " ^^C".*-^-?) " 
Lemma 5 is proved in Section 8.11.2. It implies in particular that 
ei,,,, „, „, -I > 1 if Uk > rfc > +00 

and supj. { n^q^ } < +oo. 
Poisson case 

Lemma 6. For every fi > 0, V{fi) denotes the Poisson distribution with pa- 
rameter fi. Then, 



Lemma 6 is proved in Section 8.11.3. It implies in particular that e^^^^j 
when — > oo, which can be derived from [44, 70]. 
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4. Comparison of the weights 

We investigate in this section how the loss of the final estimator may depend 
on the distribution of the exchangeable weight vector W . First, we consider 
in Section 4.1 the most classical ones, that is Efr, Rad, Poi, Rho and Loo. 
Then, we discuss in Section 4.2 whether Theorem 1 can be extended to general 
exchangeable weights. 



4-1. Comparison of the classical weights 

According to Theorem 1, any resampling scheme among Efr, Rad, Poi, Rho 
and Loo leads to an asymptotically optimal procedure. Even from the non- 
asymptotic point of view, it is not quite clear to distinguish between these 
weights with the results of Section 3. Indeed, the resampling penalties are equal 
in expectation at first order (Proposition 2), and their deviations are negligible 
in front of their expectations (Proposition 3). 

Therefore, differences between these weights can only come from second-order 
terms, either in the expectations or in the sizes of the deviations of resampling 
penalties. As a first step, we compare in this subsection second-order terms in the 
expectations of the penalties (that is, differences between second-order terms in 
(15) and (16)), for a fixed sample size. Asymptotic considerations can be found 
in the book by Barbe and Bertail [J Chapter 2] where Edgeworth expansions 
are used to compare the accuracy of estimation with many exchangeable weights. 
The asymptotic results mentioned in Section 3.4.3 may also be useful. 

Propositions 1 and 2 show that pen;(j(m) and pen(m) have the same expec- 
tation, up to the small terms (5„ p, and (S'''™^''. More precisely, 

E [pen(m) - ])en;^{m)] = - ^ (^ITa^^ " "^"^pa ) {(^\f 



n 
AeA„ 



(pcnW) r c(ponW) 



with 5„ := E 



I n,f 



px>0 



Using the explicit expressions of Sn n and 5^''™^' , Sn v and „ ' have been 

n,PA 

computed numerically as a function of np for several resampling schemes, with 
n = 200. The results are given on Figures 1-6 (with straight lines for (5„_p and 

dots for (^i'^p"^^). 

It follows that Loo weights are the most accurate ones, even when np is 
small. On the contrary, Rho {n/2) and Rad tend to overestimate peuj^ since 

"* > 5n,p (except when np is small, where the inequality is reversed). It 
also seems that the bias of Rho (g) is a decreasing function of g, as illustrated by 
Figures 3-4. Finally, Efr and Poi are strongly underestimating the ideal penalty, 
mostly because of the 1 — {npx)~^ term in Ri^\Y{n,px) and R2.w{n,p\). 
This can be summed up as follows: 



penRad ~ penRho > penLoo « peuj^j >> pcnEfr « penPoi, 



(23) 
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17,., Q i ^ ^(pcnaho(n/2)) 

Fig 3. On.p > o^ p for np > 6. 



„ , ^ -j(pcnRho(n/4)) 

Fig 4. 5n,p > <5„^p for np > 9. 
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70 ao 90 
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90 100 



Fig 5. <5„,p Si <5„ p 



^ ^ ^ -7:(ponRad(l/2)) , ^ ^ 

Fig 6. 5„.p > (5„_p /or np > 6. 



where ">>" means a comparatively large gap, but still negligible at first or- 
der. Hence, we can expect that the Loo penalty is the most cfBcient, closely 
followed by Rad and by Rho. However, from the non-asymptotic point of view, 
it turns out that smaller prediction loss is obtained by overpenalizing slightly 
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(and sometimes strongly, see the simulations of Section 5 and the discussion of 
Section 6.3.2). Then, the ordering of (23) may also be the one of the prediction 
performances of RP, the best performances being obtained with Rad and Rho. 
This is confirmed by the simulation study of Section 5. 

Another mterestmg pomt is that o„ ^ oc d„ p when np is large enough. 
Then, provided that histograms with too small bins arc removed from the col- 
lection, penLoo and penRho arc almost equivalent, up to the choice of the factor 
C. If a wise tuning of C is possible, it remains to choose between Loo and Rho 
according to computational issues (see the discussion of Section 6.2). 

4.2. Other exchangeable weights 

The oracle inequality of Theorem 1 is only stated for the five "classical" ex- 
changeable weights of Section 2.2. Nevertheless, replacing the threshold 3 by 
some T > 2 at step 1 of Procedure 1 , the proof of Theorem 1 can be extended 
to any resampling weight vector W satisfying: 

1. is exchangeable, 

2. Ri^w{n,p)+R2^w{'n,p) ~ 2Cw for np large enough (with a non-asymptotic 
control on the ratio between these two quantities, as in the proof of Propo- 
sition 2), 

3. Ri^w{n,p) + R2,w{n',p) > {l + €)Cw for some e > 0, as soon as np > T > 2 
(as in Lemma 15). 

In particular, the first two conditions hold for all the exchangeable weights 
considered in Proposition 2. The third one is satisfied for most of them as soon 
as T is large enough (see Lemma 15 in Section 8.6). 

5. Simulation study 

As an illustration of the results of Section 3, the prediction performances of 
Procedure 1 (with several resampling schemes). Mallows' Cp and l/-fold cross- 
validation are compared on some simulated data. 

5.1. Experimental setup 

We consider four experiments, called SI, S2, HSdl and IISd2. Data are generated 
according to 

= s(X,)+a(X,)Q 

where (^i)i<i<n independent with uniform distribution over X = [0; 1] 
and (ei)i<i<n ^^'^ independent standard Gaussian variables independent of 
{Xi)i<i<:n- The experiments differ from the regression function s (smooth for 
S, seeTigure 7; smooth with jumps for HS, see Figure 8), the noise type (ho- 
moscedastic for SI and HSdl, heteroscedastic for S2 and HSd2) and the sample 
size n (see Table 3). Instances of data sets are plotted on Figures 9-12. 



S, Arlot/ Resampling penalization 



27 




Fig 11. HSdl: HeaviSine, a = 1, n = 2048. 



Fig 12. HSd2: HeaviSine, a{x) = x, n = 
2048. 
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The collections of histogram models also differ according to the experiments. 
Define 



Vfc,A:i,fc2eN\{0}, (/A)AeA. 



(fci,fc2) 



2fci' 2fci 



U 



0<j<fci-l 



fc' k 
"f 

2 



J_ 

2fc2 



and 

0<j<fc-l 

1 j + 1 
- + 

2 2fco 



0<j<fc2-l 



For every m S ( N\ { } ) U ( N\ { } ) , let Sm be the histogram model associated 
with the partition {I\ ■ Then, for each experiment, the collection of models 

is {Sm)j^ with different index sets A^„: 

SI regular histograms with 1 < D < n{\n{7i)) ^ pieces, that is 



Mr 



ln(n) 



S2 histograms regular on [0;f/2] (resp. on [f/2;l]), with Di (rcsp. D2) 
pieces, 1 < Di,D2 < n ( 2 In(ri) )~^. The model of constant functions is 
added to A^„, that is 



X„ ={f}U 1 



21n(n) 



HSdf dyadic regular histograms with 2*^ pieces, < A; < ln2(7i) — f , that is 

Mn = { 2*^ s.t. < fc < ln2(7l) - 1 } . 

HSd2 dyadic histograms regular on [0;l/2] (resp. on [1/2;!]) with bin sizes 
2-'=i (resp. 2-'==), < fci,A:2 < ln2(n) - 2 (dyadic version of S2). The 
model of constant functions is added to Mn, that is 

Al„ = { 1 } U { 2'' s.t. < /c < ln2(n) - 2 }^ 

Note that the collections of models used in experiments S2 and HSd2 can adapt 
to s and a{-). Therefore, the oracle model is generally quite efficient so that the 
model selection problem is more challenging. 
The following procedures^ are compared: 

Mai Mallows' Cp penalty: pen(TO) = 2a^Dmn~^ where ct^ is the classical 



variance estimator defined as 



.n,'5'[„/2j. 



n - [n/2\ 



(24) 



where yi...„ = (i^i)i<i<n G R", '5'[n/2j is any model of dimension [n/2\ 
(only assumed to have a bias negligible in front of ct^) and d is the 



^The code used for computing resampling penalties is available on the author's webpage 
at http : //www. di . ens . f r/~arlot/lndex .htm. 
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Euclidean distance on M". The non- asymptotic validity of this model 
selection procedure in liomoscedastic regression has been assessed by 
Baraud [13]. 

E [peuj^j] Expectation of the ideal penalty: pen(m) = E [penij(m)], which wit- 
nesses what is a good performance in each experiment. 
VFCV y-fold cross-vahdation, with V G {2, 5, 10, 20} (defined as in [9]). 

LOO Leave-onc-out (that is VFCV with V ^n). 
penEfr Efron (n) penalty (7) with C = Cw = 1- 
penRad Radcmacher (1/2) penalty (7) with C = Cw ~ 1- 
penRho Random hold-out (n/2) penalty (7) with C ~ Cw — 1- 
penLoo Leave-onc-out penalty (7) with C = Cw = n — I. 

For each of these, the same penalties multiplied by 5/4 are also considered (and 
they are denoted by a -I- symbol added after the shortened names). This intends 
to test for overpenalization (the choice of the factor 5/4 being arbitrary and 
certainly not optimal, see Section 6.3.2). 

In each experiment, for each simulated data set, first the models with 2 data 
points or less in one piece of their associated partition are removed. Then, the 
least-squares estimators Sm are computed for each m € Mn- Finally, m G Mn 
is selected using each procedure and its true excess loss £ ( s, ) is computed as 
well as the excess loss of the oracle inimeM„ (s, )■ N ^ 1000 data sets are 
simulated, thanks to which the model selection performance of each procedure 
is estimated through the two following benchmarks: 



E \£(s,s-)] 

E[mt„g7H„£(s,S„i)J 



inf„ig7n„ £(s,s„i) 



Basically, Cor is the constant that should appear in an oracle inequality like (9), 
and Cpath-or corresponds to a pathwise oracle inequality like (8). Since Cor and 
Cpath-or approximatively give the same rankings between procedures. Table 3 
only reports Cor', the values of Cpath-or are reported in [8]. 



5.2. Results and comments 

First, the above experiments show the interest of both Resampling Penalization 
(RP) and VFCV in several difficult frameworks, with relatively small sample 
sizes. Although RP and VFCV cannot compete with simple procedures such 
as Mallows' Cp from the computational point of view, they are much more 
efficient when the noise is heteroscedastic (S2 and HSd2). In these difficult 
frameworks, the prediction performances of RP and VFCV are comparable to 
those of E [penjj]. Note that in HSd2, penRad and penRho give smaller losses 
than any penalty proportional to the dimension of the models (see Section 7.1.2). 
Moreover, penRad and penRho perform slighlty worse than Mallows' Cp for the 
easiest problems (SI and HSdl), which can be interpretated as the unavoidable 
price for robustness. 
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Table 3 

Accuracy indices C'oi for each procedure in four experiments, it a rough estimate of 
uncertainty of the value reported ( that is the empirical standard deviation divided by V~N ; 
N = lOOOj. In each column, the more accurate procedures (taking the uncertainty into 

account) are bolded 



Experiment 


SI 


S2 


HSdl 


HSd2 






Sin f TT" J 




±±t;(XV lOllit. 




1 




1 




Ti ^s&mple sizej 


200 


200 


2048 


2048 


Mr. 


regular 


2 bin sizes 


dyadic, regular 


dyadic, 2 bin sizes 




1 Qi Q + n 03 


2 oqfi + 05 


1 nos + n nn4 


1 1 02 + 004 


F Tripn. . 1-t- 




2 028 + 04 


1 003 + 003 


1 089 + 004 


Mai 


-L.C/^O _1_ \J,\J'-I 


O.UOl _1_ \j .\J I 


1 ni ^ + n 003 


J- .O 1 ij _1_ W. U -LU 


Mal+ 


1.800 ± 0.03 


3.173 ± 0.07 


1.002 ± 0.003 


1.411 ±0.008 


2-FCV 


2.078 ± 0.04 


2.542 ± 0.05 


1.002 ± 0.003 


1.184 ±0.004 


5-FCV 


2.137 ± 0.04 


2.582 ± 0.06 


1.014 ±0.003 


1.115 ±0.005 


10-FCV 


2.097 ± 0.04 


2.603 ± 0.06 


1.021 ± 0.003 


1.109 ±0.004 


20-FCV 


2.088 ±0.04 


2.578 ±0.06 


1.029 ± 0.004 


1.105 ±0.004 


LOO 


2.077 ± 0.04 


2.593 ±0.06 


1.034 ± 0.004 


1.105 ±0.004 


penRad 


1.973 ±0.04 


2.485 ± 0.06 


1.018 ± 0.003 


1.102 ±0.004 


penRho 


1.982 ±0.04 


2.502 ± 0.06 


1.018 ± 0.003 


1.103 ± 0.004 


penLoo 


2.080 ±0.04 


2.593 ± 0.06 


1.034 ± 0.004 


1.105 ±0.004 


penEfr 


2.597 ± 0.07 


3.152 ± 0.07 


1.067 ±0.005 


1.114 ± 0.005 


penRad+ 


1.799 ± 0.03 


2.137± 0.05 


1.002 ± 0.003 


1.095 ±0.004 


penRho+ 


1.798 ± 0.03 


2.142 ± 0.05 


1.002 ± 0.003 


1.095 ±0.004 


penLoo+ 


1.844 ± 0.03 


2.215 ± 0.05 


1.004 ± 0.003 


1.096 ± 0.004 


penEfr+ 


2.016 ±0.05 


2.605 ± 0.06 


1.011 ± 0.003 


1.097 ± 0.004 



Second, in the four experiments, the best procedures always are the overpe- 
nalizing ones: many of them even beat the perfectly unbiased E [peuj^j], showing 
the crucial need to overpenalize. This phenomenon disappears for small a and 
large n [S, Experiments SO.l and SIOOO], hence it is certainly due to the small 
signal-to-noise ratio. We would like to insist on the importance of the overpenal- 
ization phenomenon, which is seldom mentioned in theoretical papers because 
it vanishes in the asymptotic framework, and it is quite hard to find from the- 
oretical results. 

Let us now compare RP and VFCV. According to the four experiments of 
Table 3, RP with Rad or Rho resampling schemes clearly outperforms VFCV for 
any V , even without overpenalizing. The only exception to this is HSdl where 
2-fold cross-validation yields a particularly good model selection performance. 

This can be interpretated thanks to the non-asymptotic study of the perfor- 
mance of F-fold cross-validation provided in [9]. In short, VFCV overpenalizes 
within a factor 1 -I- 1/(2(1/ — 1)), while the V-fold criterion has a variance de- 
creasing with V . 



S. Arlot/ Resampling penalization 



31 



Then, when overpenahzation is necessary (for instance in SI, S2 or HSdl), 
small values of V can outperform the leave-one-out {V = n). Nevertheless, 
RP with the right overpenahzation level C/Cw leads to a smaller prediction 
loss than VFCV, because RP provides a less variable model selection criterion 
than VFCV. The reason why penRad and penRho also perform slightly better 
without overpenahzation is that they naturally overpenalize when C = Cw = 1 
(see Section 4). 

Let us now consider the model selection performance of RP with several 
exchangeable resampling schemes. The two best ones are Rad and Rho in the 
four experiments, with or without overpenahzation. Then, Loo performs slightly 
worse (but not always significantly) and Efr much worse. Looking carefully at 
the values of the penalties, it appears that Rad and Rho slightly overpenal- 
ize. Loo is exactly at the right level, and Efr underpenalizes (as well as Poi, 
which has performances quite similar to the ones of Efr, see [8]). Note that 
this comparison can also be derived from theoretical computations (see Sec- 
tion 4). Since overpenahzation is benefic in the four experiments of Table 3, this 
explains why penRad and penRho slightly outperform penLoo. In the case of 
Efron's boostrap penalty, underpenalizing implies ovcrfitting which explains the 
comparatively bad performances reported in Table 3. 

We conclude this section with remarks concerning some particular points of 
the simulation study. 

• On the same data sets. Mallows' Cp and its overpenalized version Mal-I- 
were performed with the true mean variance E[cr^(X)] instead of 
(which would not be possible on a real data set). It yielded worse model 
selection performance for all experiments but S2, in which Cor (Mai) = 
2.657 ± 0.06 and Cor(Mal-l-) = 2.437 ± 0.05. Therefore, overpenahzation 
is crucial in experiment S2, more than the shape* of the penalty itself. 
Moreover, the overpenahzation level being fixed, resampling penalties re- 
main significantly better than Mallows' Cp. Hence, the performances of 
Mallows' Cp in Table 3 are not only due to a bad estimation of the mean 
noise- level (see also Section 7.1). 

• Eight additional experiments are reported in [S], showing similar results 
with various n, a and s (although the assumptions of Theorem 1 are not 
always satisfied). 

• Resampling penalties with a V-fold subsampling scheme have also been 
studied in [9, Section 4] on the same simulated data: exchangeable resam- 
pling schemes always give better model selection performance than non- 
exchangeable ones (significantly when V is small), except for Efr and Poi 
which tend to underestimate the ideal penalty. 



*The shape of a penalty is defined as the way pen(m) depends on m up to a Unear 
transformation. 
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6. Practical implementation 

This section tackles three main issues for using Procedure 1 in practice: how 
to compute the resampUng penahy (7)? how to choose the weights 14^? how to 
choose the constant CI 

6.1. Computational cost 

An exact computation of resampUng penalties with exchangeable weights (with- 
out using formula (50) for histograms) would be either impossible or computa- 
tionally expensive. We suggest two possible ways to fix this problem. 

First, one can use a classical Monte-Carlo approximation, that is draw a small 
number B of independent weight vectors instead of considering each element 
of the support of ^^(W). Practical Monte-Carlo methods for the boostrap are 
proposed for instance by Hall [39, Appendix II]. Moreover, a non-asymptotic 
estimation of the accuracy of Monte-Carlo approximation can be obtained via 
McDiarmid's inequality (see Arlot, Blanchard and Roquain [10, Proposition 2.7] 
for a precise result using the same idea in another framework). This would 
provide a practical way of quantifying what is lost by making a Monte-Carlo 
approximation, and choose B consequently (at least for Rad, Rho and Loo 
weights). 

Second, it is possible to use non-exchangeable weight vectors W such that the 
cardinality of the support of 'D{W) is much smaller than n. A case-example is 
V -fold subsampling: given a partition [Bj )]^<j<y of { 1, ... , n} and J a uniform 
random variable over {1, . . . ,V} independent of the data, we define 

V 

e {!,..., n}, Wi^ y—jl,^Bj- 

The resulting resampling penalties — called l^-fold penalties — have been intro- 
duced and studied in [()] . They are computationally similar to VFCV while being 
more flexible, since the overpenalization factor is decoupled from the choice of 
V; hence, like resampling penalties, y-fold penalties select an estimator with 
smaller prediction loss than the one selected by VFCV. 

Both Monte-Carlo approximation of RP and l^-fold penalization have been 
tested on the simulated data of Section 5. The detailed results are given in [8]. 

6.2. Choice of the weights 

The influence of the weights has been investigated from the theoretical point 
of view in Section 4 with focus on second-order terms in expectation. However, 
deviations of pen(77i) around its expectation are likely to depend on the weight 
vector W since the upper bound in (17) may not be tight. The simulation study 
of Section 5 allows to take into account both phenomena in the comparison 
between the resampling weights. 
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In terms of model selection efficiency, Table 3 shows that the best weights 
(for accuracy of prediction and for the variability^ of this accuracy) are Rho and 
Rad, whereas Loo perform slightly worse. On the contrary, from both accuracy 
and variability points of view, Efron's bootstrap weights perform worse than 
Rho, Rad and Loo, mainly because they lead to underpenalization. 

Note however that this comparison strongly depends on the precise defini- 
tion -'^^ of Civ, which makes all penalties unbiased at first order but possibly un- 
der or over-penalizing at second order. Then, different prediction performances 
may be observed on data which do not require overpenalization. Nevertheless, 
the computations of Section 4 show that Efron's bootstrap weights have a real 
drawback which cannot be fixed only by changing Cw- 

When computing the penalties exactly. Loo weights are the only computa- 
tionally tractable ones, while being almost as accurate as Rho and Rad. Hence, 
we suggest their use, enlarging the constant C when needed (see Section 6.3.2 
on overpenalization). 

However, computing n empirical risk minimizers (or the outputs of compu- 
tationally more expensive algorithms) for each model is not always possible. 
In such a case, one should avoid using the Leave-one-out with a Monte-Carlo 
approximation, which would give a large importance to a small number of data 
points. Rho or Rad weights are much safer in this situation. Alternatively, one 
may consider the use of ^-fold penalties [9] as a good alternative when the 
computational power is limited. 

Let us emphasize that this analysis and the subsequent advices should be con- 
sidered with caution. First, the deviations of resampling penalties around their 
expectations should be understood much better, because they can be compa- 
rable or even larger than the second-order terms in expectations. Second, the 
optimal choice of V for T^-fold cross-validation is known to be different between 
least-squares regression and binary classification [9, Section 2.3]. Such differences 
are expected to arise for choosing between exchangeable resampling weights. 

Remark that the bias of the bootstrap penalty has already been noticed 
by Efron [30, 31] who proposed several ways to correct it, including a double 
bootstrap procedure and the .632 bootstrap. The novelty of the approach of this 
paper is to propose the use of other exchangeable resampling schemes instead 
of the boostrap so that the bias of resampling penalties no longer has to be 
corrected. 

^The variability of the accuracy is more an indicator of the stability of the performance of 
RP than of the variance of the resampling penalty. However, it remains an interesting measure, 
since a procedure performing always equally well can be preferred to a procedure with better 
mean efficiency but poor performances on a small probability event. 

-'^'^However, it is quite unclear how to change C\\r in order to optimize each penalty in the 
general case. This is why Cw h^^s been chosen as "simple" as possible in Table 2. 



S. Arlot/ Resampling penalization 



34 



6.3. Choice of the constant C 

6.3.1. Optimal constant for bias 

From the asymptotic point of view, the optimal C ^ C* for prediction is gener- 
ally the one for which pen estimates the ideal penalty pen;^ unbiasedly (at least 
for collections of models of polynomial size). This is how Cw is defined in the 
histogram framework and Theorem 1 implies that C = Cw is asymptotically 
optimal for prediction. Hence^^, C* is asymptotically equivalent to Cw- 

As showed by Arlot and Massart [11], C* can also be estimated directly from 
data for general penalties, in particular for RP. Hence, the knowledge of Cw 
is not necessary, which can be useful in the general prediction framework (see 
Section 7.2). 

6.3.2. Overpenalization 

A careful look at the proof of Theorem 1 shows that a similar oracle inequality 
holds for any C > 4:Cw/5, the leading constant remaining close to one when C ~ 
Cw asymptotically. In other words, when the sample size n is small, the optimal 
constant C* may not be exactly equal to Cw- The simulations of Section 5 also 
support this fact: Overpenalization, that is, taking C = CovCw with Cov > 1, 
can improve the prediction performance of s'g^ when n is small, when a is large 
or when s is non-smooth. 

This problem would appear even if the "optimal" constant C* such that 
pen is non-asymptotically unbiased was known. On Figure 13, the estimated 
model selection performance of the penalty CovE [peni^(m) ] is plotted as a 
function of Cov, for experiment S2 of Section 5. It appears that the optimal 
overpenalization constant C*^ S (1.5; 2.35) for this particular problem. More 
generally, the drawback of using C = C* is that it does not take into account 
the deviations of pe'a^^{m) around its expectation. To avoid the possible overfit 
induced by these deviations, the constant C must be slightly enlarged. A major 
issue remains: How to estimate C*^ from data only, since it strongly depends 
on n, on a, on the smoothness of s and on the number of models in A^„? 

One can think of choosing Cov by F-fold cross-validation, but this would lead 
to a computationally intractable procedure. An alternative idea is to use resam- 
pling for building a simultaneous confidence region on {pen-^^(m) )„g^ instead 
of estimating E [peni^{m) ] only (see [10] on confidence regions built with general 
exchangeable resampling schemes). Then, the uncertainty on the estimation of 
pen;jj(m) can be taken into account for choosing a model, similarly to model 
selection procedures built upon relative bounds [12, 24]. Finally, the choice of 
the overpenalization factor would be replaced by the choice of a confidence level 
which should be made by the practicioner. See also [G, Section 11.3.3] for a 
discussion on a data-driven choice of the overpenalization factor. 

^^See the proof of Theorem 1 in [d] to prove that asymptotic optimality requires 
C* /Cyv > 1 as soon as there are enough models close to the oracle. 
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Fig 13. The non-asymptotic need for overpenalization: the prediction performance Cov (de- 
fined in Section 5.1) of the model selection procedure (2) with pen(m) = CovE [pen;jj(m) ] is 
represented as a function of Cov ■ Data and models are the ones of experiment S2: n = 200, 
cr{x) = X, s{x) = sin(7rx). See Section 5 for details. 



7. Discussion 



7.1. Comparison with other procedures 

In this article, tlic Resampling Penalization (RP) family of model selection pro- 
cedures is defined and showed to satisfy some optimality properties under mild 
assumptions on the data (Theorems 1 and 2). In particular, RP is robust to the 
heteroscedasticity of the noise according to both theoretical and experimental 
results. The price for robustness is that the computational cost of RP is gener- 
ally larger than simple procedures like Mallows' Cp, even with the suggestions 
of Section 6.1. The purpose of this subsection is to identify the "easy" prob- 
lems, for which the computational cost of RP can be reduced by using Cp-like 
penalties without enlarging the prediction loss too much. 



7.1.1. Mallows' Cp 

Mallows' Cp penalty is equal to 2a^Dm'n~^ for a model Sm of dimension Dm, 
when the noise- level a is constant. Non-asymptotic results about Cp-like penal- 
ties can be found in [16, 13, 14, 21]. They imply that Mallows' Cp is asymptot- 
ically optimal in the homoscedastic framework, when the size of A4n is polyno- 
mial in n. 

When the mean noise-level is unknown, it must be estimated. A classical 
estimator of E is defined by (24). Baraud [1.3, 14] showed that the 

resulting data-driven model selection procedure satisfies a non-asymptotic oracle 
inequality with leading constant close to one. 

Assume for the sake of simplicity that n is even and let S'„/2 be a model such 
that each piece of the associated partition contains exactly two data points. 
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Reordering the (X^, Yi) according to Xi, 

P6IlMallows("^) 



n/2 

V? ^ 

i=l 



(X2i — Yli-l) 



so that 



[penMaiiows("^)] 



A6A„ 



/2 

i=l 



(25) 



where (cr^)^ E [cr(X)2 | X e /a] 
This should be compared with the result of Proposition 1 : 



where 



E^-ipeni.Ml^- ^ (Kf +K) 

AgA,„ 

[{s{X) - s„AX)) 



(26) 



:= E 



X^Ix 



Although both Mallows' Cp and the ideal penalty are in expectation the sum 
of a "variance" term (involving the (cr^)^) and a "bias" term (involving the 
variations of s through {s{X2i) — s(X2i_i))^ or (cr^)^), they differ on at least 
two points. 

First, when s is smooth and minA£A,„ {np\} is large, the "bias" term in (25) 
is negligible in front of the one of (26), which means that Mallows' Cp under- 
penalizes when the "bias" component of penj^j is large. Second, the "variance" 
component of penjj, which is the main one in general, is distorted in Mallows' 
Cp'. the part of the penalty corresponding to I\ is multiplied by DmPx which is 
not close to 1 when the partition {I\ is not regular with respect to V^X). 

This happens for instance in experiments S2 and HSd2 of Section 5. Therefore, 
there are at least three possibly "hard" problem classes: 

• hcteroscedastic noise, with irregular histograms and X uniform (for in- 
stance S2, HSd2 in Section 5, or Svar2 in [s]), 

• hcteroscedastic noise, with regular histograms and X highly non-uniform 
on X, 

• regression function s with jumps (such as HcaviSine^"^) or large non-smooth 
areas (such as Doppler in [S]). 

In either of these cases, one should avoid the use of Cp-like penalties, and we 
suggest resampling penalties as an efficient alternative. As explained in Sec- 
tion 7.1.2 below, the first class of problems can make any penalty proportional 
to the dimension Dm suboptimal. 



^^However, in experiment HSdl, Mallows' Cp still behaves quite well compared to RP. We 
do not know whether the non-smoothness of s can actually make Mallows' Cp fail. 
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7.1.2. Linear penalties 

Mallows' Cp is simple because it is a linear function of the dimension of Sm'. 

pen(m) = KDm (27) 

and K is the only constant to determine. Depending on what is known on the 
mean variance level, the constant -ftTiviaUows can be defined as 

2E [cr(X)2] or id'^n'^ . 

Refined versions of Mallows' Cp have also been proposed [16, 14, 21] but they 
are still linear or very close to linearity. 

However, according to (11), the ideal penalty is not linear in general, even 
in expectation. Moreover, there exist some frameworks in which any penalty 
of the form (27) is suboptimal when data are heteroscedastic [7], that is, it 
cannot satisfy any oracle inequality with leading constant smaller than some 
absolute constant k > 1. In other words, the optimal linear penalization proce- 
dure pen^pj ii,i(m) := K* Dm is suboptimal, where 

A-*Gargmin{P7(s^(^))} 
and yK > 0, m{K) e arg min { P„7 ( s„i ) + KDm } ■ 

As showed by Theorem 1, RP does not suffer from this drawback. 

On the one hand, the optimal linear penalization procedure has a better 
model selection performance than RP for SI, S2 and HSdl, which is not surpris- 
ing for the "easy" problems where Mallows' Cp is almost optimal (SI, HSdl). It 
is less intuitive for S2 where data are heteroscedastic. Considering that pen^p^ jjj^ 
uses the knowledge of the true distribution P, one can understand that it is suf- 
ficient to keep a good performance for "intermediate" problems. 

On the other hand, in experiment HSd2, the optimal linear penalization has 
a model selection performance Cor = 1-18 ± 0.01, which is worse than the one 
of RP (Cor < 1.11). Thus, the most difficult problem of Section 5 (with a large 
collection of models, heteroscedastic data and bias) gives an example where 
linear penalties are definitely not adapted, in addition to the ones of [7]. 

7.1.3. Ad hoc procedures 

One of the main advances with Theorems 1 and 2 is that RP is proved to work 
in the heteroscedastic framework contrary to Mallows' Cp. Nevertheless, in a 
framework such as the one of experiment S2, Mallows' Cp can be adapted to 
heteroscedasticity by splitting X into several parts where a is almost constant, 
and performing the histogram selection procedure with Mallows' Cp separately 
on each part of X. 

More generally, Efromovich and Pinsker [28] and Galtchouk and Pergamen- 
schikov [35] (among several others) defined estimators of s that are minimax 
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adaptive in the heteroscedastic framework, the latter by model selection. In the 
Gaussian regression framework, Gendre [37] proposed a model selection method 
for estimating simultaneously the regression function and the noise level. 

All these procedures may perform slightly better than RP in terms of predic- 
tion loss. They are called "arf /loc" because they have been specially designed 
for the heteroscedastic framework (and a particular collection of estimators for 
[35, 37]). On the contrary, RP is a general-purpose device: It was neither built 
to be adaptive to heteroscedasticity nor to take advantage of a specific model, 
and RP has exactly the same definition in the general prediction framework (see 
Section 7.2). 

When no information is available on the data or when no model selection pro- 
cedure is known for using such information, we suggest the use of RP. Moreover, 
available information can be partial or wrong. Then, using an ad hoc procedure 
would be disastrous whereas a general device like RP would still work. In short, 
choose RP if you have no useful information or if you do not trust them. 

l.H. Other model selection procedures by resampling 

The most well-known resampling-based model selection procedure is cross-va- 
lidation. For practical reasons, it is often used in its V-fold version which can 
have some tricky behavior, in particular for choosing V [68, 9]. This can also 
be showed in the simulation experiments of Section 5 (see Table 3): In HSdl, 
V = 2 performs better than V G {5, 10, 20}, a phenomenon explained in [9] by 
analyzing how the bias of the F-fold criterion depends on V. 

y-fold penalization, that is, RP with a y-fold subsampling scheme, was pro- 
posed in [9] where it was showed to improve significantly the model selection 
performance of VFCV. In this paper and in [8], RP with several exchangeable 
resampling schemes — generalizing the V = n case — is proved to perform at 
least as well as F-fold penalization and often better. 

Several penalization procedures use the bootstrap for estimating the ideal 
penalty [30, 25, 62]. As noticed in Remark 6, the penahzation procedures studied 
by Shibata [62] are quite close to RP, although they are restricted to bootstrap 
weights, which are the worst ones in the framework of the present paper (see 
Sections 4.1 and 6.2). Moreover, they do not consider useful to multiply the 
penalty by a factor C possibly different from one, contrary to what is suggested 
in RP. The factor C is crucial because it disconnects the choice of the weights 
from the overpenalization problem. 

In order to select the correct model asymptotically with probability one, 
Shao [59] proposed to use RP with the M„ out of n bootstrap and provided a 
sufficient condition on M„ to achieve model consistency. Thanks to the unified 
approach for all the exchangeable resampling weights provided in this paper, 
Shao's condition can be rewritten as C = 1 » Cw (see Remark 6), which 
corresponds to the known fact that model consistency requires overpenalization 
within a factor tending to infinity with n [..]. Hence, we conjecture that RP 
with a constant C ^ Cw is model consistent for most exchangeable W, which 
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may improve Shao's penalties since Efr(M„) weights are probably not the best 
weights in terms of accuracy (see Section 4) and variability^'^. 

7.2. Resampling Penalization in the general prediction framework 

As mentioned in Section 2.1, Resampling Penalization is a general-purpose 
method which is definitely not restricted to the histogram selection problem. 
The purpose of this subsection is to define properly RP in the general predic- 
tion framework and to discuss briefly what differences can be expected compared 
to the histogram selection framework. 

1.2.1. Framework 

Suppose we observe some data {Xi, Yi), . . . , (X„, Y„) G X xy independent with 
common distribution P. The goal is to predict Y given X where {X, F) ~ P is 
independent of the data. The quality of a predictor t : X i—^ y is measured by the 
prediction loss P^{t) := E(x,y) [7(^7 {X, Y))] where (X, Y) ^ P and 7 is a given 
contrast function. Typically, 7(<, [x^y)) measures the discrepancy between t{x) 
and y. The excess loss is defined as £ (s, := P7 [t) — inft:;^^^; P7 (i), even if 
s = argmint {P7 (t)} is not well-defined. Classical examples are least-squares 
regression where ^ = M and 7(t, (x, y)) = {t{x) — yY and binary supervised 
classification where 3^ = { 0, 1 } and 7(t, (x, y)) = lt{x)^y is the 0-1 contrast. 

A general prediction algorithm s" is then defined as a function associat- 
ing a predictor to any data sample. In order to simplify the presentation, 
algorithms are assumed to depend only on the empirical distribution P„ = 
n^^ ^(Xi,Yi) as an input^**. For instance, the empirical risk minimizer over 

a set Sm of predictors is defined as 'SmiPn) '■= argmin^g^^ Pnj{t), provided 
the minimum in 5"^ exists and is unique. 

Let us assume that a collection of algorithms (sm)„jg^ is given. The goal 
is to select some data-dependent m G A4n minimizing the prediction loss 
P"f (sm(Pn) )■ The penalization method consists in selecting 

TO e arg min { P„7 ( ?,„ ( P„ ) ) -|- pen(TO) } , 

meMn 

where pen : A4n i-^ M is a penalty function, possibly data-dependent. Since the 
goal is to minimize the prediction loss, the ideal penalty is 

penid(?«) := (P - P„)7 (sm(P«) ) = Pm(P, P«) 

which cannot be used because it depends on the unknown distribution P. When 
A4n is not too large (for instance, when Card(A1„) < Cn" for some positive con- 
stants C, a), a natural strategy is to define pen(TO) as an estimator of pen^^lm) 
with a bias as small as possible. 

^^Taking into account all the data for computing the resampling penalty with Efr(Af„) 
weights is computationally costly when n/M„ is large. 

^^Otherwise, we can consider algorithms whose input is any weighted sample. 
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7.2.2. Definition of Resampling Penalization 

As detailed in Section 2.2, the resampling heuristics can be used for estimating 
E [peni^(TO)] = E [Fm(-P, Pn)], leading to the following procedure. 

Procedure 3 (Resampling Penalization). 

1. Replace A^„ by 

Mn ~ {iTl G M.n S.t. 'Sm{Pn) IS WCll-dcfincd } . 

2. Choose a resampling scheme, that is the distribution ^{W) of a weight 
vector W. 

3. Choose a constant C > Cw- 

4. Compute the following resampling penalty for each m G Mn'- 

pcn(m) = C^w [Pnl (s™ (Pf) ) - P^l ( ) ] , (28) 

where := n'^ Y.ti W,5f^x,^Y^- 

5. Select m G argmin^^^^^ {P„7 ( ( P„ ) ) + pen(m) }. 

As for the histogram selection problem, two possible problems have to be 
solved. First, s™(P^) may not be well-defined for a.e. W even if m g M.n- 
A way to define properly the resampling penalty for every m & A4n such that 
Sm(P,S!^) is well-defined for every W G (0, -|-oo)" is suggested in [(>, Section 8.1]. 
This assumption is satisfied by regressograms (hence, in the framework of the 
rest of the paper) for which the suggest of [G, Section 8.1] yields exactly the 
penalty (7). 

Second, the constant Cw such that (28) estimates unbiasedly pen;^(m) when 
C = Cw is required in Procedure 3. For the histogram selection problem, the ex- 
plicit expression of Cw follows from Propositions 1 and 2. In general, the asymp- 
totic theory of exchangeable bootstrap empirical processes [(Hi, Theorem 3.6.13] 
suggests that Cw = 1 if var(M^i) ^ 1, which holds for the classical weights Efr, 
Rad. Poi and Rho; nevertheless, asymptotic control on the bias is not sufficient 
when the collection of algorithms is allowed to depend on the sample size n, 
as in the histogram selection problem. Therefore, further theoretical investiga- 
tions would be useful to compute the theoretical value of Cw to be used in 
Procedure 3. From the practical point of view, the data-driven calibration algo- 
rithm of [] 1] can be used for choosing the constant C in front of the resampling 
penalty. 

7.2.3. Model selection properties of Resampling Penalization 

The theoretical validity of Procedure 3 is only proved for histogram model selec- 
tion in this paper, because precise non-asymptotic controls of the ideal penalty 
and its resampling counterpart are needed. To our knowledge, the only known 
result about model selection with Resampling Penalization was that RP with the 
classical bootstrap weights (Efr) is asymptotically optimal for selecting among 
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maximum likelihood estimators in [62], assuming that the distribution P belongs 
to some parametric family of densities. 

RP can be conjectured to enjoy adaptivity properties for a wide class of model 
selection problems for two main reasons. First, RP relies on the resampling idea 
which is known to be robust in a wide variety of frameworks; Theorems 1 and 2 
have confirmed the robustness of RP to heteroscedasticity, whereas RP has not 
been designed specifically for least-squares regression with heterosccdastic data. 
Second, several of the key concentration inequalities used to prove Theorems 1 
and 2 have been extended in [11, Propositions 8 and 10] to a general framework 
including bounded regression and binary classification. 

As mentioned at the end of Section 7.2.1, Procedure 3 should be restricted to 
choosing among a number of algorithms at most polynomial in n. Indeed, when 
Card(Al„) is larger, estimating unbiasedly peuj^ can yield strong overfitting 
[21]. Therefore, RP must be modified for large collections Mn- We suggest to 
group algorithms according to some modelling complexity index Cm , such as the 
dimension of Sm if 'sm is the empirical risk minimizer over some vector space 
Sm'-i then, for every C £ C„ = {Cm s.t. m e Ain}, define sc := ^m{c) '^-'^^re 
fhiC) G argminc^=c P„7 {'Sm{Pn) ); finally, apply Procedure 3 to the collection 
(sc)cgc ' assuming that Card(C„) is at most polynomial in n. 

"7.2.4- Related penalties for classification 

In the classification framework, RP should be compared to several classical 
resampling-based penalization methods. First, RP with Efr weights was first 
introduced by Efron [■'■>[)] and called bootstrap penalization; its main drawback 
is its bias (as for the histogram selection problem), which can be corrected in 
several ways, using for instance the double bootstrap penalization or the .632 
bootstrap [3(J]. Nevertheless, the computational cost of the double bootstrap is 
heavy and the general validity of the .632 bootstrap is questionable because of 
its poor theoretical grounds. 

Second, the global Rademachcr complexities were introduced in order to ob- 
tain theoretically vahdated model selection procedures in classification [45, 17]. 
They are resampling estimates of 

peUid (to) := sup {{P - Pn)l{t)} > {P - Pn)l {'Sm{Pn)) = penid(TO), 

with Rad weights; more recently, Fromont [•^)4] generalized global Rademachcr 
complexities to a wide family of exchangeable resampling weights and obtained 
non-asymptotic oracle inequalities. Nevertheless, global complexities (that is, 
estimates of peuj^ J) are too large compared to peuj^j so that they cannot achieve 
fast rates of estimation when the margin condition [53] holds. 

Therefore, localized penalties taking into account the closeness between Sm{Pn) 
and s have been introduced, in particular local Rademachcr complexities [50, 18, 
19, 4G]; these papers proved sufficiently tight oracle inequalities to ensure that 
the final prediction loss can achieve fast rates. Nevertheless, local Rademachcr 
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complexities are computationally heavy and depend on several constants which 
are difficult to calibrate. 

RP aims at combining the advantages of these three approaches in classifi- 
cation. From the practical point of view, RP is computationally tractable (see 
Section 6.1) and reasonably easy to calibrate (see Section 6.3). Compared to 
global Rademacher complexities, resampling penalties estimate directly penj^j, 
so that RP should be able to achieve fast rates of estimation when the margin 
condition holds. Finally, contrary to the bootstrap penalty, RP can be used with 
several resampling weights including i.i.d. Rademacher weights (Rad), so that 
the bias of RP may not have to be corrected. 

7. 3. Conclusion 

This article intends to help the practicioner to answer the following question: 
When should Resampling Penalization be used? To sum up, we list below the 
advantages and drawbacks of RP vs. the classical methods. 

Advantages of RP 

• generality: well-defined in almost any framework. 

• robustness and versatility: designed for the cautious user. 

• adaptivity to several properties, in particular hcteroscedasticity and smooth- 
ness of the target. 

• flexibility: possibility of overpenalization, either for non-asymptotic predic- 
tion or for identification. 

Drawbacks of RP 

• computation time: one may prefer l^-fold procedures such as \^-fold cross- 
validation or F-fold penalties [!)]. 

• possibly outperformed by Mallows' Cp (for easy problems) or ad hoc proce- 
dures (in some particular frameworks, when some information on the data 
is available). 

8. Proofs 

8. 1 . Notation 

Before starting the proofs, we introduce some additional notation and conven- 
tions: 

• The letter L denotes "some positive absolute constant, possibly different 
from some place to another" . In the same way, a positive constant which 
depends on ci, . . . , Cfc is denoted by ici,...,cfc ; if (A) denotes a set of assump- 
tions, L(A) denotes any positive constant depending on the parameters 
appearing in (A). 
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• By convention, ool^; and l^/O arc both equal to zero when the event E 
does not hold. 

• For any a; G R, a;+ := a; V = max(a;, 0) and X- := {—x) V 0. 

• For any non-negative random variable Z, e^j-^-j := E [Z] E [Z^^lz>o] ■ 

• For any model m G Mrn 

pi{m) := P (7(sm) - 7(sm)) P2im) := P„ (7(sm) - 7(sm)) 
?(m) (P„-P)(7(5™)-7(s)). 

• Histogram-specific notation: for any g > 0, m G Mm A G and any 
random variable Z, 



[Z] ■.= E[Z\ (U.e/Ji<,<„,,eA,„J ll^ll^"^ ^=1^^-" tl^l']'^' 

:= r - , := (E [\Y - s^{X)\'^ \ X G h]f'' 

Sx.i:= ^ {Y,-I3x) and 5^,2:= iY^ ~ ■ 

• Conventions for pi and p2 when is not well-defined (in the histogram 
framework) : 



pi(m) :=pi'"'(to) + ^ pa(cta)^1p^^o 



AeA,„ 

with pi^^-)- E ^^li 

and P2(m) :=p2(m) + i ^ [axf 1,-^^^ 

AeA„ 

Note that pi(m) = pi-°\m) = pi{m) and P2{m) = P2{m) are well-defined 
when Sm is uniquely defined, and other models are always removed from 
Ain- The above convention is only important when writing expectations, 
so it is merely technical. In the following, pi (resp. P2) will often be written 
simply pi (resp. p2)- 

Using the above notations, pi{m) and P2(jn) can now be computed explicitly 
for histogram models. For any m G Mn such that minAeA,„ Px > 0, 

pr(m)= >: pa(/3a-/3a) =- >: I (29) 



AeA,„ AeA„i \ ' 

E Fa (/3a- /3a)' = i E f- 
AeA„ AeA„, \ 



P2{m) = V Fa ( /3a - /3a ) = V I 1 - ,o5^ I (30) 



since /3a - /3a = S\.i/{npx). 
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8.2. General framework 

The main results (Theorems 1 and 2) actually are corollaries of a more general 
oracle inequality (Lemma 7). First, two different assumption sets under which 
Lemma 7 holds are stated in this subsection. The first one (Bg) deals with 
bounded data, the second one (Ug) with unbounded data. 



8.2.1. Bounded assumption set (Bg) 

There is some noise: ||cr(X)||2 > 0. 
(PI) Polynomial size of Mn- Card(A^„) < ctutt,"^. 
(P2) Richness of Mn- 3mo S M„ s.t. G [ v^; CrichV«]- 
(P3) The weight vector W is exchangeable, among Efr, Rad, Poi, Rho and 

Loo. 

(P4) The constant C is well chosen: -qCw > C > Cw- 
(Ab) Bounded data: ll^iH^ < A < oo. 
(-A-m/) Local moment assumption: there exist ag, Z?o > such that for every 
g > 2, for every m G Mn such that > Dg, 



P,n{<l) 2 ^ ^^r- 

Z^AeAm "^2, A 

(Ap) Polynomially decreasing bias: there exist /3i > /32 > and , > 
such that, for every m € Mn, 

C^D-!"- <i{s,s„,)<C+D-,P\ 

(Aq) There exist Cg > and Dq > Q such that for every m ^ Mn with 

" " AeA,„ 

(Ar^) Lower regularity of the partitions for 'D{X): there exists c^^^ > such 
that for every m G Mn, D,n minAgA,„ Pa > c?». 



4 



8.2.2. Unbounded assumption set (Ug) 
(Ab) is replaced in (Bg) by 

(AfTniax) Noise-level bounded from above: (t'^(X) < (jf^-^^^^ < oo a.s. 
(Asjiiax) Bound on the target function: < A < oo. 

(Ag^c) Global moment assumption for the noise: there exist Ogc^gg > such 
that for every q> 2, 

P'\q) ■■= !le!l,<a,,q«-. 
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(A(5) Global moment assumption for the bias: there exists ^ > such 
that for every to G A^,i with Dm > Dq, 

h-Sm\\^<cl^,Js{X)-Sra{X)\\^. 

8.2.3. General result 

Lemma 7. Let n e N\ {0}, 70 > and ffi be defined by Procedure 1. Assume 
that either (Bg) or (Ug) holds with constants independent of n. 

Then, there exists a constant Ki [that depends on 70 and all the constants 
in (Bg) (resp. (Ug)), but not on n) such that 

e{s,s-)< [27j-l + ilniny)-'/']jnf^Jiis,s,„,)} (31) 

holds with probability at least 1 — Kin^^° . 

Lemma 7 is proved in Section 8.7. 

Remark 8. If the lower bound in (Ap) is removed from the assumption set, then 
there exist constants 71,72 > (depending only on resp. on and ^ge) and 
an event of probability at least 1 — Kin^'^° on which 

i{s,^-)<[2r^^l + {Hn))-"''] inf (,, J„ ) } + (32) 

_D„>(ln(n))Ti 

This assertion is proved in Section 8.7.3. 

Remark 9. In the infimum in (31), Sm may not be well-defined for some m S 
A4n- By convention £ (s, ) is defined as +00 for these to. 

From the proof of Lemma 7, there exists a constant c > (depending on 
c(My 7o and c^g) such that every model of dimension smaller than cn (ln(n) ) 
belongs to Mn on the event where (31) holds. For each of these models, 

i{s,Sm) = i{s,s,n) +p'i*°^(m) =£{s,s„i) +pi{m) 

so that the infimum can be restricted to models of dimension smaller than 
cn ( ln(n) ) ^ with any of these conventions for i {s, ) . 

The main results of the paper (Theorems 1 and 2) can now be proved, which 
is done in Sections 8.3-8.5. 

First, the assumptions of Theorem 1 imply (Bg). Second, the alternative 
assumption sets stated in Section 3.3.2 imply (Bg). Third, the assumptions of 
Theorem 2 imply (Bg) except the lower bound in (Ap), so that Remark 8 can 
be used instead of Lemma 7. 
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8.3. Proof of Theorem 1 

Lemma 7 is applied with 70 = 2. In order to deduce (8), it remains to show that 
(Am,^) and (Aq) are satisfied. Both hold with Dq ~ 1 since for every m G M.n: 



p;m ^ fei<i < 11^-7^)11^ < ^ (33) 



— mill' 



Let be the event on which (8) holds true. Then, 



< [2r]-l + e.n 



inf {e{s,s,n)} 



which proves (9). Following Remark 9, (9) also holds with A^„ replaced by 

{rn e Mn s.t. Dm < c(q: a^, c^'^,)??. (ln(n) } 
and the convention pi(m) = pi^^\m). □ 

8.4. Proof of Theorem 1: alternative assumptions 

In this section, the statements of Section 3.3.2 are proved. 

8.4- 1- No uniform lower bound on the noise-level 

When amin = in (An), Lemma 8 below proves that (Aq) also holds with 
Do = -^(Bg)- Therefore, using (33), (A^.f) holds with the same Dq. □ 

Lemma 8. Let X C M'', m e M n, and assume that positive constants ^, a^, Cr,ui ^o-j Ja 
exist such that 

(Ari) maxAeA„ {diam(JA)} < ^^-D""'' diam(X), 
(Aru) maxA6A„ {Leb(/A)} < Ci.,u£>~i and 
(Act) a is piecewise Ka-Lipschitz with at most jumps. 

Then, 

Leb(A') ||a||^.(Leb) Kl «j'diam(A')2 ||a(X)||^^ 

Wt?i. — 



Lemma 8 is proved in the technical appendix [8]. 
Remark 10. Since ||<t(X)||2 > and a is piecewise Lipschitz, ||c||i2(L(,b') > 0. 
Thus, the lower bound on Qm^ is positive when Z?,„ is large enough. 
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8.4-2. Unbounded data 



We still use Lemma 7, but the proof is a little longer and requires the following 
Lemma 9 which is proved in the technical appendix [S]. 

Lemma 9. Assume that A" C M is bounded and the following: 

(Al) EI_B, Bq,cj >0 such that s : X t-^M. is B-Lipschitz, piecewise and non- 
constant (that is, ±s' > Bq on some interval J G X with Leb(J) > cj). 
(Ar£.u) Regularity of the partitions for Leb; 3cr,^, Cr,u > such that 

Vm e Mn, VA e A„, c^.^f"^ < Lcb(/A) < C^^uAn^- 

(Adf) Density bounded from below: 3c;^'" > 0, V/ C X, P{X e I) > c^"Leb(/). 
Then, (A(5) holds true, that is, for every model Sm of dimension _D,„ > Dq, 



with 



-"A.jri 



Cr.l 



Woo < CI.„A\-S{X) - Sm{X)\\ 



3/2 



BVM 



and Do := 4cr.uC T^. 



Pathwise oracle inequality We prove that (8) holds with probability 1 — 
Kin^^° for a general 70, since it will be required for proving a classical oracle 
inequality below. First, (Aj„^), (Aq) and (Ag hold since for every m G A^„ 



AgA„ 



(p) 



< 



Q 



)(p) 



— mill 



(9) — "^maxCgauss^/?- 



Second, Lemma 9 (with (Al), (Ar^^u) and (Adf)) shows that (AS) holds 
with ,„ = L(Ug) and Do = ^(Ug)- 

Classical oracle inequality Let 51„ be the event on which (8) holds true 
with 7o = 6 + ctM- As in the bounded case, it suffices to upper bound 



by Cauchy-Schwarz 



2||»||», + 2pi(m)2 



\ 



E^ 



^ Pi(m)2l 



m£M„ 
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For every m G ^Am a bound on E'^™ [(pi(m))^] is required. Starting from (29), 



04 
■^A.l 



< 



E 

A6A„, 



04 



-E(- 



2 

max 



{2Af 



P\ PA' 2 2 
— — "i2,A'^2,A' 
PA PA' 



Since 



E^' 



o4 



E^ 



(E;..g/,y.-/3A))' 



m\ ^ 6{npx - l)m: 



2, A 



and An E <A<(«^'Z^O (^max + (2A) 



"■PA nPA 
2\2 



AeA„ 



Hence, using that Card(A^„) < CMn"^ , 
which proves (9). 



□ 



8.5. Proof of Theorem 2 

In this proof, (H) denotes the set of assumptions made in Theorem 2. (H) 
imphes all the assumptions of Theorem 1 except maybe the lower bound in 
(Ap); indeed, (Ad^) and the fact that all the models are "regular" imply (Ar|^). 
Therefore, we can start from (32) in Remark 8 below Lemma 7 which does not 
require the lower bound in (Ap) to hold. The constants 7i are absolute because 
the data are bounded. 

2k k ~2k 

Let to(To) € Mn be the model of dimension closest to R'^^^ n^^+^ tj^!^'' ■ 
By definition of Tq and Mn, 

2-^R^n^<jS^ <To< 2R^n^aS^ . 

If n > i(H),cj 2^0 larger than (ln(n) )"'^ and smaller than cn (In(ri) ) ^ . Hence, 
from the proof of Lemma 7, m(To) € A^„ and m(To) has a finite excess loss on 
the large probability event of Lemma 7. Moreover, 



)<i{ 



S, S 



' *m(To) , 



Pi^°\m{To)) 



when n > i(H)- Since i [s, Sm(To) ) ^ R^Tq ^" and 



E 



p^^''\m{n)) 



< sup eg(„ ) 

."P>0 



E ((- 



AeA^ 



^a) ■ 



'(To) 
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^ 2R^T^ ^" 2g-inax-Pm(To) 



(the bound e^^^^^ < 2 coming from [-38, Lemma 4.1]), an event of probability 
at least 1 — A'Jn^^ exists on which 

\ m / 

where if 2 may only depend on k and a. Note that the constant Ki has been 
replaced by K[ > Ki so that the probability bound 1 — K[n^^ is nonpositive 
when n is too small. Enlarging K[ once more, the term (In(n))'''^ n^^ can be 
dropped off by adding 1 to the constant if.- Then, taking expectations as in 
the proof of Theorem 1, (10) holds. 

When (Act) holds, fTmax can be replaced by | If 11^2 (Let,) definition of 

m(To). Then, for every A G A„i(^To) such that a docs not jump on Ix, 



{<y\f < m^axa^ < ^ + ^ l^a^{t)Leh{dt)^ 



<{l + e-^) ^ + (1 + 9) / a'{t)Lch{dt) 

for every 6 > Q (since Leb(A') — 1). If cr jumps on Ix (and there exist at most 
Jc such A), max/^ cr^ < crj^iax- Hence, taking 9 = 7^^, 



E 



n 1 ^ — ' 



.(To) 



^ 2i?%l-^" ^ 2iJ^(To)lklli2(Leb) ^ ^(H) 

~ 71 n n 

and the end of the proof does not change. In this second case, (An) can also be 
removed because all the assumptions stated in the first part of Section 3.3.2 are 
satisfied. □ 



8. 6. Additional probabilistic tools 

Several probabilistic results are needed in addition to the ones of Section 3.4 for 
proving Lemma 7. First, Proposition 10 below deals with concentration proper- 
ties of pi and p2. Remark that concentration inequalities for p. can be obtained 
in a general framework [11 , Proposition 10]. On the contrary, we do not know 
any other non-asymptotic bound on the two-sided deviations of pi . 

Proposition 10. Let 7 > and Sm be the model of histograms associated with 
some partition {Ix)x<£A,n "/A". Assume that minxeA„^ {it-Px} ^ Bn cind that 
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positive constants ae, ^£ exist such that (A^ ^) Vq > 2, Pm{q) < aiq^' . Then, if 
Bn > I, an event of probability at least 1 — Ln^"' exists on which 



pi(m) > E [pi{m)] - Lai,^t,j 
Pi{m) < E [pi (to) ] + La, ,7 



(ln(n) 



-LB„ 



ZD,, 



(ln(n)) 



E[p2(m)] 

:b2M] (34) 



\P2 (m) - E[p2 (m)] I < La„^,,^ ^ J±- E [p2 (m) ] . 
Moreover, if Bn > 0, an event of probability at least 1 — Ln ' exists on which 



pi{m) > 



(7+l)ln(n) 



(ln(n)) 



-LB„ 



E[p2im)]. 

(35) 



Proposition 10 is proved in [8]. Second, Lemmas 11 and 12 below provide 
concentration inequalities for S{m), when the data are either bounded or un- 
bounded. 

Lemma 11. Assume that \\Y\\^ < ^ < oo. Recall that for every to e Mn, 
S{m) = (Pn—P) i^{sm) ^"/{s))- Then for every x > 0, an event of probability 
at least 1 —■ 2e^^ exists on which 



V?7>0, \S{m)\<rie{s,s,n)+{- + - , 

\ri 6 J n 



4 8\ A 



In particular. 



\5{m)\ < 



'(s,s„0 _ 20 EA™[p2( 



/On 



3 O^P) 



(36) 



(37) 



Proof of Lemma 11. (36) essentially relies on Bernstein's inequality and is proved 

in details in [11, Proposition 8]. Then, (37) follows from (36) with ?/ — Dm 

(v) 

and the definition of Qm ■ □ 



Lemma 12. Assume that positive constants Og^, S,ge, (imax o,nd ^ exist such 
that 

(Ag,,) yq>2, P3-{q)<ag,q^^', 
(Aa,nax) ||'t(X)|(^ < 

{A5) \\s-Sm\\^<c%^MX)~Sm{X)\\^. 
Then, for every x>Q, an event of probability at least 1 ~ exists on which 



\5{m)\ < 



£is,Sm) + ^E[p2{m)] 



(38) 
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Moreover, if (A^^^) and (A(T„iax) holds true, but (AS) is replaced by (Asmax) 
\\-''\\oo — ^^fi'^; Jot every x>Q, an event of probability at least 1 — e^^ exists 
on which 

|?(m)| < ia,„c,.,A,.„_n-i/2x«-+i/^ (39) 

Lemma 12 is proved in Section 8.10. Third, Lemma 13 ensm'es that empirical 
frequencies npx are not too far from the expected ones np\. 

Lemma 13. Let {p\)\^\^ be non-negative real numbers of sum 1, {np\)\^\^ 
be a multinomial vector of parameters (n; (pa)agA„,) md 7 > 0. Assume that 
Card(A,„) < n and min^gA,,, {np\} > Bn > 0. Then, an event of probability at 
least 1 — Ln^^ exists on which 

. _ minAeA,„ { npA } , ^^^ 1 \ ia(^\ 

mm { np\ \ > 2(7 + 1) m(??). (40) 

A£A^ 2 

Proof of Lemma 13. First, for every A G A™, Bernstein's inequality Propo- 
sition 2.9] applied to np\ shows that an event of probability at least 1 — 27i^'-'''+^-' 
exists on which 

np\ > npx - V2"-Pa(7 + 1) in("-) 5 • 



Since y^2np\{'y + 1) ln(n) < {np\)/2 + (7 + 1) ln(n), (40) holds on an event of 
probabihty at least 1-2 Card(A,„)n-(T+i) > 1 - 2^-''. □ 

Finally, Lemmas 14 and 15 below are useful to compare the expectations of 
Pi and p2 on the one hand, and the expectations of pen and pen^ for possibly 
large models on the other hand. 

Lemma 14 (Lemma 7 of [!)]). //miuAeA^ {np\} > B >1, 
{l-e-^)E[p2{m)]<E\pi^°\m)] <E[pi(m)]< ( 1+ sup (5„,p ) E [p2(m) ] 

L J V np>B / 

where 6n,p is the same as in (15). A similar result holds with p2 instead of p2 
inside the expectation. 

Lemma 15. Assume that W is a weight vector among Efr, Rad, Poi, Rho and 
Loo. Let Sm be the model of histograms associated with the partition 
P2{rn) = P„ (7(sm) — 7(sm)) o,nd pen(TO) be defined by (7) with C = Cw (see 
Table 2). Then, i/minAGA„ {npx} > 3, 

E^" [pen(TO)] > ^E'^- [p2{m)] . (41) 

//miuAgAm {tt-Px} ^ T for some positive T, (41) still holds for weight vectors 
among: 

• Efr(Mn) when A/„n-i > -P-^ ln(3/4 - 2/T) 

• Rad(p) when P >p-^ ln[8/ {3(1- p))] 

• Poi(fi) when T > 3 and fj,P > 1.61 

• Rho(q„) w/ien T > ng~^ ln[(4n)/(3(n — g„))]. 

Lemma 15 is proved in Section 8.9. 
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8. 7. Proof of Lemma 7 

We first give the complete proof in the bounded case. Then, we will explain how 
it can be extended to the unbounded case. 

8.7.1. Bounded case 

For every m G Mm define 

pcn[^{m) -.^ pi{m) + p2{m) - 5{m) = peni^{m) - (P - Pn)l{s). 
By definition of pcnl^^ and m, for every m G Mm 

^(*'^m) ~ (P'^nJdC^) -pcn(m)) < ^(s,s™) + (pen(TO) -pcn^^im)). (42) 
The proof of Lemma 7 is divided into three main parts: 

1. With a large probability, pen — pcnj^j is negligible in front of ^ ( s, s'm ) uni- 
formly over models Sm of "intermediate" dimension, that is (In(n))'''^ < 
Dm < cn (ln(n) for some constants c, 71 > 0. This relies on the concen- 
tration inequalities and comparisons of expectations stated in Sections 3.4 
and 8.6. 

2. The model fh selected by Resampling Penalization has an "intermediate" 
dimension. In order to prove this, a lower bound on 

crit"(m) := P„7 (?„ ) + pen(TO) - P„7 (s) 

is proved for large and small models, and this bound is showed to be larger 
than crit"(mo), where Sm„ is the model of intermediate dimension belong- 
ing to the collection {Sm )meM according to assumption (P2). Lemma 15 
is crucial at this point. 

3. The oracle model (that is the one minimizing £{s,^m)) is also of "inter- 
mediate" dimension, which is proven similarly to point 2 with crit"(TO) 
replaced by £ {s, 'sm ). 

For every m G Mn, define 

An(m) := min {npx} and i?„(m) = min {np\} . 

Let r2n,7o be the event on which the concentration inequalities of Propositions 3 
and 10 and Lemmas 11 and 13 hold for every m e A^„ with 7 = aj^ +70 (or 
similarly x — {olm +70) ln(n) in Lemma 11). Using assumption (PI), the union 
bound gives P (r2n,7o ) ^ 1 ~ ^cM'n~^° ■ 

1. pen is close to pen|^ for intermediate models Let c, 71 > be two 

constants to be chosen later, and consider Mm the set of m € Mn such that 
(In(n))'''^ < Dm < cn (ln(7i) )^^. According to (Ar^), for every m G Mn, 
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Bn{m) > cfg^c \n{n) so that (40) ensures that A„(m) > hi(n) on ^n^-^o 

c < L^x _aM.io- particular, Mn C M n on fin, 70 • 

Assume also that n > exp(£'o), so that Dm > Dq for every rn e A^,i if 
7i > 1. Now, using both bounds on 

maxllpKm) - E [^^[(m)]! , \p2{m) - E [p2{m)]\, 
|(5(m)| , |pen(m) — E'^'" [pen(m)]|} 

is smaller than i(Bg) (ln(n) (s, s™ ) + E [p2(™) ] ) on ^n,-io provided that 

c < L^x ,^ (to ensure that Bn{m) is large enough) and 71 > 2^^ + 6. Fix now 
c = L^x^ .^ > and 71 = L^^ satisfying these conditions. Using Proposition 2, 

Lemma 14 and the lower bound on _B„(m), we have for every m G A^„ 



^(Bg) 



(ln(n)) 



jj^e{s,s„i) < (pen-pen|d)(m) 



< 



2(^-1) 



L 



(Bg) 



(ln(n)) 



1/4 



i{s,Sm) . 



as soon as n > i(Bg) (this restriction is necessary because the bounds are in 
terms oi £ {s,'sm) instead of £ ( s, Sm ) + E [p2 ])• Combined with (42). this gives: 
if n > L(Bg) 



277-1 



L 



(Bg) 



(ln(n)) 



1/4 



inf {e{s,Srn)}. (43) 



2. in has an "intermediate" dimension The penalized empirical criterion 
crit(m) = P„7 (sm ) + pen(m) has the same minimizers as 

crit"(?Ti) = £ ( s, s,„ ) + pen(m) — penj^(m) = £ (s, ) + pen(m) —p2{'m) + S{rn) 
over Mn- 

According to (P2), there exists mg G A4n such that < D,no < CjichV^- 
If n > -/j(Bg)i "iQ G A^)i so that (using (Ap) and the same inequalities as in the 
first part of the proof) 



crit"(TOo) < £{s,Sjno) + 13(^0)1 + pcn(mo) < i(Bg) [n ^^^^ + 



-'/^y (44) 



Therefore, it remains to provide lower bounds on crit"(TO) for m ^ Mn- 
On the one hand, on ^nrta if < ( ln(77,) , 



Crit"(TO) > £ {s, Sm) — [(^("l)! — P2(™) 



>C^ (Inin))-'''^' -La, 



70 



In(ri) ( ln(n) ) 



l+Cf+7l 



^(Bg)- 



(45) 



On the other hand, if Dm > cn (ln(n)) ^ and m G A^n, by Lemma 15, 
E^- [pen(m) - P2{m)] > E^" [p2{m)] /4:. Therefore, we have pen(m)— p2 (jn) > 
(1 — i(Bg)?^^"^''"')IE [P2(™)] on rin.joj SO that 



crit"(m) > pen(m) — P2("^) — |'^(w)| > i(Bg) (lii('^)) 



(46) 
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when n > i(Bg)- Comparing (44). (45) and (46), it follows that any minimizer 
rh of crit over A4n belongs to on rin,7oi provided that n > i(Bg)- 

3. the oracle has an "intermediate" dimension It remains to prove that 
the infimum ean be extended to Ain on the right-hand side of (43), with the 
convention i{s,'sm) = +oo if A„{m) = 0. Using similar arguments as above 
(as well as the definition of i^n.-yoi particular (35) for large models), we have 
£{s,s„io) < -^(Bg) (?^~*^^ + n~^/^) on ^,u-io- Moreover, for every m ^ Mn, 
either An < ( hi(n) and i {s,s,n) > i {s, s,„ ) > L(Bg) ( ln(7i) y^'^' or > 
cn(ln(n))~^ and i{s,Sm) > Piim) > i(Bg) (ln(n))~^ on iln,'yo by (35) as soon 
as n > -£'(Bg)- Hence, if n > i(Bg), m ^ cannot contribute to the infimum 
in the right-hand side of (43). This concludes the proof of (31) in the bounded 
case. □ 



8.7.2. Unbounded case 

The proof of the bounded case has to be slightly modified. In the definition 
of i^n,7o, the concentration inequalities of Lemma 11 are replaced by those of 
Lemma 12. Then, 71 has to be chosen such that 71 > 2^gf + 3. The rest of the 
proof of (43) is unchanged. 

In order to prove that rh G Mn, (45) has to be slightly changed because of 
the use of (39) instead of (36) to bound 6{m,). The final part of the proof is 
then modified similarly. □ 



8.7.3. Proof of Remark 8 

We now prove the assertion made in Remark 8 below Lemma 7. Starting from 
(43), we can prove in the same way that < cn (In(ri) )~^, but _D^^ < 
(ln(?T,) )'''^ cannot be excluded. 

Let m G A4„ such that A„ < (ln(n) . Assume first that 

^(s,s™)> inf {t{s,s^)}+ ^ ^ (47) 

l-(ln(ri)) mex„ (l-(ln(n)) )n 

where e„ < -/^(Bg) (lii('^)) comes from (8). Then, on rin, 701 using (36) with 
?7 = (ln(n))"^ and (47), 

crit"(TO) > I {s, Sm ) ^ \6{ra)\ — P2{m) 

(ln(n))«^+^^+^(ln(n)-L(Bg)) 



>(27^-l+e„) inf {£(s, ?,„)} + 



n 



>(2,7-l + e„) inf (s, ) } + ^^^^ , (48) 
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provided that n > -Zj(Bg). In addition, let mo € argmin^^,^^ {i {s,'Sm')}- 
Since mp g on i^n,-yoj 

crit"(mo) = i{s,s„io ) + pcn(mo) - penJd(mo) < (27? - 1 + £„)£(s,Sm(, ) , 

and this upper bound is smaller than the lower bound in (48). 

Hence, on i^n,jo, if < (ln(n))^^ (47) cannot be satisfied with ?7i = fh. 
Moreover, by (34), for every tti € Mn such that < cn (ln(n) 

Pi(m)<L(Bg)(ln(n))«^+2:^ 

n 

on fin, 70 . Therefore, 

2^-l+£^ • f /.r - u^r (ln(n))^^+"^+^ 

n ^ (s, ) } + i(Bg) 

l-(ln(ri)) meM„ 

< (277-1 + (ln(n))-^/'') inf (s,s„,) } + ^^^^^^^^^^^ (49) 

assuming that n > L^^^y 

When > (ln(n))^\ (31) holds on Vt^^-yo which implies (49). Hence, (49) 

holds on ^n,-yo- 

Finally, with the same arguments as in Section 8.7.1, the infimum on the 
right-hand side of (49) can be extended to the set of m G 7Vi„ such that Dm > 
(ln(n))'''\ with the convention £{s,'Sm) = +oo if An{m) = 0. Enlarging the 
constant Ki to remove the condition tt, > i(Bg)i (32) is proved to hold with 
72 = 7i + + 3. The proof is quite similar in the unbounded case. □ 



8. 8. Expectations 

Proof of Proposition 1. On the one hand, (11) and (15) are consequences of (29) 
and (30); note that (15) holds whatever the convention taken for pi and p2 in 
Section 8.1. 

On the other hand, (12) follows from Lemma 16 below which is slighlty more 
geenral since W is allowed to depend on (Ijf.g/^ '-' 

Lemma 16. Let Sm be the model of histograms adapted to some partition 
iI\)xeA '^f ^ ! ^ ^ [0;cx3)" he a random vector such that for every A G Am, 
{Wi)xi£ix exchangeable and independent of {Xi,Yi)xiei\ - Let pcn(7r7) be de- 
fined by (7) and assume minAgA„ {np\} > 1. Then, 

C ■r-^ np\S\ 2 — ^ 

pen{m) = — > iRi,w{n,px) + R2,w{n,px)) ^ ^ TV^nv.>2^ (^0) 

n ^ np\(np\ — 1) "Pa^^ 
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where Ri^w cind i?2,VK dfe defined by (13) and (14), that is 



Ri,w{n,p\) ■■= E 
and R2.w{'^i'P\) ■= E 



Wx 



Xieh,Wx>Q 



Proof of Lemma 16. First, as j)en^^{m) was split into pi{m) and p2{m) (plus a 
centered term), the resampling penalty (without the constant C) is split into 
two terms: 



P2(m)= ^ Evi. (^f -^a) 



Wx>0 



AeA„ 



(51) 
(52) 



A key quantity to compute is the following: for every A € A™ and Wx > 0, 

, 2 



E 



- ^a) 



Wx 



Wx 



n^px 



W 
Wx 



Wx 



(53) 



E {Y^ - pxKYj - ^x)^v 



n?px 



1 - 



w 

Wx 



1-^ 
Wx 



Wx 



Since the weights are exchangeable, {Wi)xi£ix is also exchangeable conditionally 
on Wx and (Xi)i<i<„. Hence, the "variance" term 



Rv{n,npx,Wx,V{W)) ■.= Ew 



W,-Wx 



Wx 



does not depend on i (provided that Xi E Ix) and the "covariance" term 

Rein, npx, Wx, V{W)) -.^ Ew [{W, ^ Wx ) {W, ^Wx) \ Wx 
docs not depend on (provided that i ^ j and Xi,Xj £ Ix)- Moreover, 

2 



= E 



w 



( E {W.-Wx)] 



Wx 
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= npxRv{n,npx,Wx:V{W)) + npx {npx - 1) Rc{7i,npx,Wx,V{W)) 
so that if npx > 2, 

Rc{n,npx,Wx,W) = Rv{n,npx,WxMW)) (54) 

npx — i 

and Rv{n,l,Wx,V{W)) = 0. Then, (53) and (54) imply 

Rvin,npx,Wx,V(W)), 



w 



Wx 



Wxri^Px 
npx 



npx>2 



-s. 



1 



A, 2 



npx-l npx 
Finally, (50) follows from the combination of (51) and (52) with (55). 



(55) 



□ 



8.9. Resampling constants 

Some results relative to the exchangeable weights introduced in Section 2.2 are 
proved in this subsection. First, Lemma 17 below provides explicit formulas for 
Ri,win-,px) and R2,w{n-TPx) which appear in the explicit formula (50) for the 
resampling penalty. 

Lemma 17. Let ?i G N and px G (0, 1] such that npx G {1, . . . , n}. Then, for 
every M G N\{0}, p e {0;1], fj. > and q e {1, . . . 



-Rl,Efr(A/) 
Rl,Rad{p) 

^l,Poi(/i) 
^l,Rho(<;) 
^l,Loo 



n , f 1 
— e+ ^ 1 — 

M B(KLpx) \ npx 

V. -1 

p B(npx,p) 

-e+ ^ 1 — 

^ V(npxti) \ npx 

n j_ 

-e+ ^ - 1 

q T-l{n,npx,q) 

npx 



R. 



'2,Efr(Af) 



R 



2,Rad(p) 



2,Poi(/j) 



R. 



M 
1 

= - - 1 
P 

_ 1 
n 



1 

npx 
1 

npx 



2,Rho(g) 



(56) 
(57) 

(58) 
(59) 



i{npx~l) 



i?2. 



Loo 



n - 1 



where B, V and Ti denote respectively the Binomial, Poisson and Hypergeometric 
distributions and = E [Z\ E \_Z~^ | Z > O] with Z ~ /i. 

Proof of Lemma 17. Since W is independent of the data, the observations with 
Xi G Ix can be assumed to be the npx first ones: (Xi, Yi), . . . , (X ^ ,Y ^ ). The 
random vector (W^i)i<j<„p^ is then exchangeable (since W is exchangeable). 

Hence, by definition of Wx = {npx)~^ J2i=i 



yie{l,...,npx}, Ew W, \ Wx ^W; 



(60) 
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Then, the quantity 

Rv{n,npx,Wx,V{W)) = Rv{Wy^) =^\{W, -W^f 



appearing both in Ri^w and i?2,w is the variance of the weight Wi conditionally 
on W\. 

Exchangeable subsampling weights A suhsampling weight is defined as 
any resampling weight W such that Wi G {0, k} a.s. for every i. Such weights 
can be written Wi = uli^j for some random / C {l,...,n}. Rad and Rho 
are the two main examples of such weights and they are both exchangeable. 
This kind of weights are called "bootstrap without replacement weights" in [6G, 
Example 3.6.14]. First, when W is an exchangeable subsampling weight, (60) 
implies 



W^\ Wx 



W^^kI Wx 



Wx=Ew 
so that 

V (^W, \ Wx^ = kB{k-^Wx) and Rv(Wx) = WxiK~Wx). 

Then, this result is applied to Rad with k = p~^ and VlWx) = (npxp)^^ x 
B{npx,p) which proves (57). In the Rho case, k = {n/q) and 'D{Wx) = {qpx)~^ x 
T-l{n,npx,q) so that (59) follows. The Loo is a particular case of Rho (with 
o = n — 1) and e'^ ^ can be computed with (22) in Lemma 5. 

Efron Efron weights can also be written 

W^ = — Card { 1 < .7 < M s.t. Uj = i} (61) 

with (J7j)i<j<jv/ a sequence of independent random variables with uniform dis- 
tribution over {1, . . . , n}. Therefore, 

V(Wx) = {Mpx)~^B{M,px) and V (w, \ Wx) = ^B ( MpxWx, -i- ] 

V / M \ npx ) 

so that 



and (56) follows. 



Rv{Wx)^—Wx\\^^ 
M \ npx 



Poisson One can check that the weights defined by (61) with M = iV„ ^ 
V^pLVL) independent of the iUj)j>i, are actually Poisson (fi) weights; this is 
the classical poissonization trick [GG, Chapter 3.5]. Moreover, conditionally on 
Wx and Nn = M, the same reasoning as for Efron(M) (with a multiplicative 
constant /i^^ instead of n/AI) leads to (58). □ 



S, Arlot/ Resampling penalization 



59 



Proof of Proposition 2. From (50), (16) holds with 

n,px 

Combining Lemma 17 with Lemma 4 (for Efr and Rad), Lemma 5 (for Rho and 
Loo) and Lemma 6 (for Poi), the foUowing non-asymptotic bounds are obtained: 

1. Efron (M„): let ki = 5.1 and K2 = 3.2, then 

{Bnpxf^'^) "'P^ npx 



2. Rademacher {p): 



2 



l-p 



{H2 - 1) A 



[nppx) , 



I Ai w —1p-Vnp\ 
> g^P"^^'^'^^P>> > (•53') 

~ ".PA ~ 1 — P 



4) A f ] > ^(p™Rad(l/2)) 

3. Poisson (/i): 



1 + 3 X 10-^ A ^ " > ^^P-"-"^v-;; > _l ^ (64) 



1 A ^ ' ^ > ^(P^"P°'(^)) > ^ - f e'^"^^ A 1 (65) 

{^J.npx-2)^~ n,px ~ npx V /xnp^<1.6i; ^ ^ 

4. Random hold-out ((/fn): on the one hand, 

^(pc„Rho(q„)) ^ n / + ^ - 1^ > 



n,p\ n — q \ U(n,npx,q„) J \ — B+ 

where the lower bounds assume that < £?_ < g„n^^ < < 00. On the 
other hand, under the same condition 



^(pcnRho(g„)) ^ L / in(npj^) 

"^PA -B-{l-B+)\] npx 

provided that npx > Lb_,b^- When g„ = [n/2j, this upper bound is 
combined with (21). 
5. Leave-one-out: 

npx>2 ^ ^(ponLoo) ^ ^ 

npx-1 ~ ".PA - "PA=r ^ ' 

□ 

Proof of Lemma 15. Lemma 15 is a byproduct of the proof of Proposition 2 
(combined with Lemma 14). □ 
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8.10. Concentration inequalities 

In this subsection, concentration inequalities are proved for the resampling 
penalty (Proposition 3) and for 5{m) with unbounded data (Lemma 12). 

8.10.1. Proof of Proposition 3 

According to (50), pen(m) is a U-statistics of order 2 conditionally on {lxiei\)(i.\)- 
Then, [9, Lemma 5] with 

^ _ Riw[n,px) + R2,w{n,px) ^ _ - iRi,w{n,p\) + R2,win,p\)) 
n{npx - 1) ri^Pxinpx - 1) 

implies that for every q>2 

||pen(m)-E^'"[pen(™)]||^''-^ <i,,,5,i?„V2^-i/2 

X sup {Ri^w{n,p) + R2,w{n,p)}q^''+^'&[p2{'m)]. 

np>Aji 

Conditional concentration inequalities follow from the classical link between 
moments and concentration [G, Lemma 8.10], with a probability bound 1 — 
n~'^ . Since 1 — n~'^ is deterministic, this implies unconditional concentration 
inequalities. 

The second statement follows from the proof of Proposition 2 where non- 
asymptotic upper bounds on 

2 + 5(P°"W) = Cm/ X (i?i,vv(n,pA) + i?2,vv(n,PA)) 

n,px 

can be found. □ 

8.10.2. Proof of Lemma 12 

From [6, Lemma 8.18] which is stated and proved in [8], 



with -.^ {Y - Srn{X)f - {Y - .s{X)f 

= {sm{X) - s{X)f - 2ea{X)MX) - s{X)). 

Note that ecr(X)(s,„(X) — s{X)) is centered conditionally on X E Ix for every 
A G A,„. Hence, 

P(m)||_^ < ^^^^ (l|s-s,„||^ + 2CT,nax||s-s„J^||e||J . (67) 
Using now assumptions (Ag^c) and (A(5), for every q >2, 
\\S{m)\\^ < 2^^ {{ci^J^e ( s, s„, ) + 2c% „^^e ( s, s,„ )F3^((z)a„,ax) ^ 
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-1/2 



fg, + l/2^maxV_Dri 



Taking 9 = Dm , (38) follows from the classical link between moments and 
concentration inequalities [(i, Lemma 8.10]. For the second statement, start back 
from (67) and use that j|.s — Smllo^ 2A. □ 



8.11. Expectations of inverses 

This subsection is devoted to the proofs of the lemmas of Section 3.4.3. Note 
that [9, Section 2 of the Technical appendix] explains how to generalize (18) to a 
wide class of random variables. Two useful results can be found in [9, Technical 
appendix]: first, the general lower bound 

> ¥{Z > 0), 

comes from Jensen inequality. Second, defining 



(68) 



E[Z]E[Z-Hz>o] -e+P(Z>0), 
the following upper bound holds as soon as P{cz > Z > 0) = 0: 

Va > 0, e| = E [Z-H^j,[z]>z>o] E[Z] +E[Z-'lz>aE[z] 
<F{aE[Z] > Z > 0)E[Z]c^^ + a^^ 



E[Z] 



(69) 



(70) 



8.11.1. Binomial case (proof oj (19) in Lemma 4) 

When n > 9, the upper bound follows from (69) together with Lemma 4.1 
of [38] (showing that e^^^^^ < 2n/{n+ 1)). When n < 8, e+(„ ^/2) < 1.21 
(see for instance [6, Section 8.7]). For the lower bound, the crucial point is 
that Z ^ B (ri, ^) is nonnegative and symmetric, that is, 'D^Z) = D{n — Z). 
Using only this property and defining po = P{Z = 0) = F{Z = n) = 2^", we 
have 



¥{Z = n\ Z >0) 



nP(0 < Z < n) 
2' 




2(1 -Po) 



Since Z is binomial with parameters (n, 1/2) 



(71) 



nil - 2po) 



E 



0< Z <n 



Z{n~ Z) 

if n > 3. Putting this into (71), we obtain: 
1 



> P (Z = 1 or Z 



in -'if 
4(n- 1) 



S(n,l) - 1 



-ri-1 



1 



n{n - 2)2 
2"+i(n - 1) 



> 1. 



□ 
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8.11.2. Hypergeometric case (proof of Lemma 5) 

Let Z ^ TC{n, r, q). It has an expectation E [Z] = {qr)/n. 

General lower bound It follows from (68), 

P(Z.0)<(l-9'<exp(-^) 
and the fact that if r > n - g + 1, ¥{Z > 0) = 1. 

A general upper bound According to (69) and the lower bound for P(Z > 0) 
above, an upper bound on e'^^^ ^ can be derived from an upper bound on 

■"H{n,r,q) 



Recall the following concentration result by Hush and Scovel [41]: for 



every x > 2, 

P(E(Z) -Z>x) 
< exp ( -2(.T - i f 



r + 1 n — r + 1 



q+1 n - q+ 1 



Combined with the above concentration inequality, (70) with cz = 1, "&[Z] 
qrn~^ and a ~ 1 ~ ^ for any ;^ > /3 > ^ yields 



^ V 

< — exp 



2(/3r- 1)^ 
r+1 



Therefore, 



l-exp(-f) 



(72) 



holds for every n > r,q > 1 . 



End of the proof of (20) With the additional conditions on n, r and q, (3 
can be taken equal to ^^"^^ in(r)(7+i) .^^ ^.^^^ ^j^^^^ 



< 



1 



l+^jln(r)(r+l) 



n , , /ln(r) 
< 1 + -A'(e)i' ^ ' 



with A(e) 



1 



2yin(2) 



ln(3) 



3 



Using (69) and the upper bound on P (Z = 0), (20) follows since r > 2 and 
K3(e) = 0.9 + 1.4 X > 1.02 x A(e) + 0.03. 
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"Rho" case Assume now that q ~ 1^1 so that - — 2 + ^^-r < 3 and tends 
to 2 when n tends to infinity. 
For r > 6, /? = f in (72) yields 

4(„.6,,) < 9-68 < 7.61 e+(„.8,,) < 7.46 e+(„_g_^) < 7.32 

For r > 10, /3 = i + i in (72) yields 

sup ei,, N < 7.49 sup et,, < 3. 

r>10 r>26 



Small values of r must be treated appart. For r = 1, it is easy to compute 

r, we have < 

i! ^ ('•+1)! 



ei, , , ^ qn ^ < 1. When n = r. we have et,, s = 1. Otherwise, usine the 



fact that for every n> r + 1, , "• „ > ; , A' l 



^ ( 



^■H{n,r,q) — ^ j-^, 

with i? = ^ G [1; +oo). For r = 2, this upper bound is lower than 1.6. If ^ < 3 
(which holds in the "Rho" case), 

4(„,3,,) < 4-67 e+(„ < 8.15 e+(„^,^^) < 14.29. 

"Loo" case Assume now q = n—l. On the one hand, if r = 1, the conditioning 
makes Z deterministic and equal to 1 so that 

etr 1 i> = ^Z] = 1 - -. 

7i(n.l,n-l) L J 

On the other hand, if r > 2, Z > holds a.s. since it only take two values: 

T Tl — T 

¥{Z = r-l) = - and V iZ = r) = . 

n n 

Hence, 

+ ^ {n-l)r / r n - r \ 1 / (n - l)r \ 

^mn,rm-i) „ \{r- l)n nr ) n \ n{r-l) ) ' 

The lower bound is straightforward since n > r. 
"Lpo" case As noticed in Lemma 17, 

> 1- 

Moreover, when r > p + 1 the support of Ti.{n, r,n — p) is {r — p, . . . , r} and 



e 



(n -p)r ^ (p 



n{7i,r,n-p) ^ 5^ 



j—r—p ^ \n—p) 



S, Arlot/ Resampling penalization 64 



fc=(p+r-n)VO ^ '^P^ 

More precisely, the k-th term of the sum is equal to 
p)r (D(plD ^ (i _ r^y-'' fp\ r 



(n - 

n 



\ nJ \kj r ^ p n ■ ■ ■ {n — p + 1)' 
so that 



< 



H{n,r,n-p) - {j - p)n ■ ■ ■ {n - p + I)' 

The result follows. □ 
Remark 11 ( Asymptotics) . If for some a > 0, qkr]/"^ ""-fe^ * ^^.'^ 

A: — ^+oo 

nk > Tk ^ +00, then et., „ ^ ^ 1 when k — > oo. The upper bound is 
obtained by taking 

l+./(r,. + l)ln(^) 
(3 = i — 

Tk 

in (72), which is possible for k sufficiently large. The lower bound is straight- 
forward. 

8.11.3. Poisson case (proof of Lemma 6) 

Let Z ~ 'P(m) E^nd define g : [0; oo) R by p(0) = and for every /i > 

5(/i):=e+ ,=mE[Z-M Z>0] =-^^^ T = 

yvp; p(^) /- L I J 1 _ g-M X fc! - 1 7o 2; 

The function g is continuous at and has a first derivative g'{0) = 1. For every 
X > 0, define 

/i(a;) = ^ ^ H(x) ^ / /i(t)dt a(x) = -jT^ = 1 " . ■ 

^ ' X Jo Kx) x{e^ - 1) 

where the last equality holds if x > and a(0) = 1/2. Then, g(u) = H{u)/h{u) 
satisfies the following ordinary differential equation: 

g(0) = Vu > 0, g'{u) ^ 1 - a{u)g{u). 

Since ^ 

Vm > 0, - < a(u) < 1 and lim a{u) — 1, 

2 — y+oc 

g satisfies a differential inequation 
1 - ■ 

Then, for every x > > 0, 



-dx. 



l-f<5'<l-5 5(0) =0. 



gixo) 



> gix) > 1 + (gixo) - l)e-°-\ (73) 
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Lower hound The general lower bound (68) gives 

.g(Ai) > P(Z > 0) = 1 - e-f", 

which can be improved. Indeed, if 5(2^0) ^ li (73) shows that g{x) > 1 for every 
X > xq. Since g = H/h and for every m > 0, 

^2 y3 u(u+ — + —) 

Hiu) > It H \ , it follows that q(u) > — ^ ^ 

^ ' ~ 4 18 -yw - e" - 1 

Then, g(1.61) > 1, so that g{x) > 1 for every x > 1.61. 

Upper bound Using (73) with xo = gives 

Vx > 0, g{x) < 2 - 26"^"= < 2. 

Moreover, for every e S (0; 1), 1 — e < a{x) < 1 as soon as a; > e^-^. Then, on 
[e~^;oo), g satisfies the differential inequation 

ff' > 1 - (1 - e)g. 

Integrating this between and 2e~^, 

g{2e-') < [1 + - ^) - 1) cxp (-e-^(l - e)-')] . 

For every x > 2, e = 2x~^ € (0; 1) so that 

,(.)<l+^+^^-^^^^P(-^^l+^(^ + --^). 
- x-2 ~ x-2 

The result follows. □ 
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